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Abstract 

In recent years, a large number of genomic and epigenomic studies have been focusing on 
the integrative analysis of multiple experimental datasets measured over a large number of 
observational units. The objectives of such studies include not only inferring a hidden state 
of activity for each unit over individual experiments, but also detecting highly associated 
clusters of units based on their inferred states. Although there are a number of methods 
tailored for specific datasets, there is currently no state-of-the-art modeling framework for this 
general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for 
State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space 
mapping and state-space clustering. In state-space mapping, it maps observations onto a 
finite state-space, representing the activation states of units across conditions. In state-space 
clustering, MBASIC incorporates a finite mixture model to cluster the units based on their 
inferred state-space profiles across all conditions. Both the state-space mapping and clustering 
can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC 
flexibly adapts to a large number of parametric distributions for the observed data, as well 
as the heterogeneity in replicate experiments. It allows for imposing structural assumptions 
on each cluster, and enables model selection using information criterion. In our data-driven 
simulation studies, MBASIC showed significant accuracy in recovering both the underlying 
state-space variables and clustering structures. We applied MBASIC to two genome research 
problems using large numbers of datasets from the ENCODE project. The first application 
grouped genes based on transcription factor occupancy profiles of their promoter regions in 
two different cell types. The second application focused on identifying groups of loci that 
are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing 
transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety 
of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing 
these data with a two-step approach using ENCODE results on transcription factor occupancy 
data. 
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tE-mail: zuo@stat.wisc.edu 
I E-mail: keles@stat.wisc.edu 


1 



1 Introduction 


The flow of genetic information through DNA transcription and RNA translation is a highly 
regulated process. The underlying mechanisms of regulation by both genomic and epigenomic 
factors are central targets in large numbers of genomic and epigenomic studies. This paper is 
motivated by a number of such studies that aim to elucidate genomic regulatory mechanisms across 
multiple biological conditions. Experiments that investigate such processes produce a plethora of 
data types. For example, relevant to DNA transcription is transcription factor occupancy data 
that quantify which regions of DNA are occupied by DNA binding proteins that can enhance or 
reduce gene expression. Histone modification data map covalent post-translational modifications to 
histone proteins, core proteins that make up the nucleosome structure of DNA. Such modifications 
impact DNA transcription by altering chromatin structure. 

Computational and statistical analysis of these data often involve identifying genomic loci 
that show significant signal, i.e., enrichment, compared to background noise in the experimental 
measurements, with the operating principle that multiple loci might exhibit similar signal profile 
over different biological conditions. 

Improvements in the next-generation sequencing technology further accelerated rapid genera¬ 
tion of these types of data. In return, the vast availability of such data has revolutionized the 
scope of genome regulation studies. Previous analyses had been restricted to detecting regions of 
genome that were associated with one particular factor in one single organism. Many recent studies 
focus on detecting more complex functional patterns that integrate data from multiple organisms 
under multiple conditions. Namely, the associations between DNA elements and how they change 
across biological and/or experimental conditions have been the focus of many integrative modeling 
approaches. Examples of these studies include: 


Differential binding analysis among multiple ChIP-seq data. One of the key mechanisms 
of gene expression regulation is through differential activities of transcription factors and epi¬ 
genetic modifications. Currently, chromatin immunoprecipitation followed by high through¬ 
put sequencing (ChIP-seq) is the state-of-the-art method for genome-wide profiling of protein- 
DNA interactions. Such two key interactions are DNA occupancy by transcription factors 
and histone modifications. Most transcription factors, i.e., DNA binding proteins, recognize 
DNA in a sequence specific manner and promote or represses gene expression. Similarly, 
histone modifications can induce diverse biological consequences such as transcriptional ac¬ 
tivation/deactivation. The study of gene regulation often involves comparing transcription 
factor occupancy and histone modifications across multiple biological conditions. Such con¬ 
ditions can be different treatment levels, time points of measurements, or different dosage 

Anders and Huber [2010 Ji et al. 


levels (Liang and Kele§ 


2012 


2013 


Wei et al. 


20151. 


Transcription factor regulatory network analysis. The combinatorial nature of transcrip¬ 
tion factor regulation underlies the large diversity observed in eukaryotic gene control. This 
largely motivates construction of regulatory networks that model gene expression as a combi¬ 
natorial function of regulatory interactions between DNA and different transcription factors. 


The large-scale data from the ENCODE project (The ENCODE Project Consortium 2012) 


now enable joint analyses of over one hundred human transcription factors across multiple cell 
types. Such analyses are posed to reveal a great amount of information about co-association 
patterns between different TFs, hierarchical network organizations, and systems-level inte¬ 


gration of complex cellular signals (Nephl et al. 2012 Gerstein et al. 2012 Cheng et al. 
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. While the large number of TFs makes it computationally formidable 
combinatorial associations for such analyses, it is important to detect 
the most significant combinatorial patterns that preserve global regulatory dynamics. 


2011 


Zeng et al. 2013 


to exhaust all possible 


Comparative functional genomic studies across different species. Functional genomics anal¬ 
ysis compares gene expressions or TF occupancy profiles between multiple species. The main 
task is to identify divergent and conserved functional modules that are central to evolution- 



expression, occupancy) among some subsets of the species under consideration. For these 
analyses, it is also important to identify functional modules that are fully divergent across 
species. These regions play an equally important role in understanding connectivity among 
species over the evolutionary history. 


Although the types of data for these different studies vary, the underlying statistical principles 
are largely shared. Therefore, we propose a unified framework for the analysis of such data by for¬ 
malizing the shared aspects. We formulate the underlying statistical problem as follows. Suppose 
a dataset {Yik} is collected over a set of observational units (e.g., loci in genomic experiments) 
i = 1, 2, • • • , I under conditions fc = 1, 2, • • • , K. Inferring the association patterns within a single 
experiment involves mapping the corresponding set of observations {Yu- ; * = 1, 2, •••,/} to a 
finite discrete state-space, = {1,2, ••• ,S}. This space contains different levels of association 
(e.g., enrichment/non-enrichment indicating the status of occupancy in ChIP-seq experiments, ex¬ 
pressed/not expressed in RNA-seq gene expression experiments). This falls under the classical 
finite-mixture modeling framework, where a latent state variable 9^^ € JY is inferred for each ob¬ 
servational unit Yi^.. A higher level of modeling on the matrix 0 = (9ik)i<i</^i<k<K is required for 
integrating the association patterns under different conditions. We call this matrix the state-space 
matrix since it describes the latent states of individual observations. 

We propose the following framework to model the state-space matrix 0. We assume that rows 
of 0 can be partitioned into J -I- 1 subsets: (1, • • • , /} = U U • • • U toj. Rows of 0 within 
partition j > 1, are generated by the same distribution parametrized by Wj. = {wjk)i<k<K- 

9tk ^ gi-\wjk), 


while the rows of ‘^o, which denotes the group of ’’singleton” units, i.e., units that do not cluster 
in any of the J groups, are generated by row specific distributions. The goal of this model is thus 
to estimate a partitioning that best characterizes the row associations of state-space matrix 0. 

We refer to the proposed framework as Matrix Based Analysis for State-space Inference and 
Clustering (MB ASIC). MB ASIC is related to classical factor analysis which considers the problem 
of projecting one dimension (either row or column) of large noisy matrices into low-dimensional 
spaces. MBASIC has two distinguished features compared to the existing literature in these areas. 
First, MBASIC deals with matrices with discrete entries, while most existing methods are designed 
for matrices on continuous scales. Second, MBASIC estimates the low-dimensional projection by 
grouping the rows of the original matrix in contrast to the Principle Component Analysis (PCA) 
approaches which form linear combinations of the rows (e.g., Ji et al. (20131, Lee et al. (20101). 
This is motivated by the following arguments: 


1. In MBASIC, each factor estimate Wj. characterizes the commonality of a group of rows and 
is easily interpretable in practice. Such interpretability can further be enhanced by imposing 
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structural restrictions on the Wj. vector for practical purposes. Examples of such constraints 


are described in Section 3.3 


2. PC A for high dimensional matrices are often accompanied by regularization techniques, which 
are computationally prohibitive for many epigenetic datasets. In contrast, clustering the 
matrix rows can be implemented very efficiently and in a straightforward manner. 


The hierarchical structure of MBASIC is similar to two other recently proposed statistical 


models: iASeq (Wei et al. 20121 and Cormotif (Wei et al. 2015). Both these models incorporate 


a state-space clustering structure similar to MBASIC. MBASIC extends these models in several 
critically essential directions. First, MBASIC is developed for general purposes and can be easily 
implemented for a wide range of parametric distributions, while Cormotif and iASeq operate with 
specific distributions targeting the problems of differential expression and allele-specific binding. 
Second, neither of these models include a group of singletons with idiosyncratic state-space pro¬ 
files. When we are agnostic about the “true” clustering structure in applications, separating the 
singletons can reduce their influence on the estimation of clustering parameters. Third, both iASeq 
and Cormotif separate estimation for the distributional parameters from the clustering structure, 
while MBASIC jointly fits all model parameters. A limiting assumption of MBASIC compared to 
these models is that MBASIC does not allow the distributional parameters within the same state 
to be heterogeneous. However, a pre-processing step that accounts for the the heterogeneity can 
overcome such a limitation. We evaluate and discuss all of these features with extensive simulation 
studies in this paper. 

This paper is organized as follows. We start with a formal description of MBASIC in Section 
followed by model estimation and selection methods in Section We also investigate general 
features of MBASIC compared to iASeq and Cormotif with extensive simulations in this section. 
Section 1^ presents results from several real data examples. Mathematical details of the algorithm 
are included in Appendix |A.1[ 


2 The Hierarchical Mixture Model Framework 


Consider a dataset with observations from / different observational units under K different con¬ 
ditions. For each condition k G {1,2, ••• ,K}, there are Uk replicate experiments, indexed by 
I = 1, 2, • • • , rifc. We use Yiki to denote the observation for the Tth replicate of unit i under condi¬ 
tion k. For each condition k at unit i, there exists a hidden state variable Oik G ■Y’ = (1, 2, • • • , S'}. 
The MBASIC model consists of the following components: 

1. State-space Mapping: 


Yikl \0ik 




1 ^kls-: ^ikls) • 


( 1 ) 


2. State-space Clustering: OikS are independently sampled from 5^ with the sampling prob¬ 
ability: 

,7 

P{Sik = s)= CPis + (1 - C) X! ( 2 ) 

i=i 
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In Q, ^kis and ums are the parameters related to the mean and dispersion for the s-th state 
for replicate I under condition k, and jikis is the covariate encoding known information for unit i. 
In ([^, Pis, C, TTj, and Wjks are additional non-negative parameters subject to restrictions: 

.IS S 

0<C<1; = '^Wjks = ^p^s = l,Vh 

j — 1 S = 1 S = 1 


We further discuss these parameters in Section 2.2 


2.1 State-space Mapping 

Equation partitions observational units i = I, • • • , I into S subsets according to their hidden 
states. Within the same replicate, data from the same hidden state follow the same distribution 
fsi'lfikis, (ikis, likis)- MBASIC assumes that the hidden states 9ikS are independent of the replicate 
index I, which means all replicates under the same condition have the same set of hidden states. 
However, distributional parameters for a given state can be different among replicates. Such a 
setting allows for the flexibility of modeling the heterogeneity in replicate experiments. 

The density function / can be from an arbitrary parametric distribution. We consider three 
fundamental families of distributions commonly used for genomic data analysis: 


• Log-normal Distribution. LN{^kisJikis,o'kis) with a density function: 


fsiylilkls 1 ^kls } tikis') — 




■ exp 


{logjy + 1) - ykislikisY 


• Negative Binomial Distribution. NB{y,kislikisT iikis) with a density function: 


Is{.y\tlkls ^ Hkls: tikis') 


^{y + cikis) {msiikis)'^crlf‘‘ 
r(crfcis)r(y) [ykislikis + 


(3) 

(4) 


• Binomial Distribution. Binom{'jikisT fJ-kis) with a density function: 


fsi.ylk'kls} 'Jikls'} 



tikis) 


likis-y 


(5) 


In these three examples, 'tiUs represents the known heterogeneity across loci whereas yus and 
Gkis are unknown parameters. For example, when using Eqn. 0 or 0 in a ChIP-seq analysis 
with S = 2 states, we can estimate ^ikii using data from the control samples so that the ChIP 


sample read counts in the background state scale with the control sample data (e.g., as in Zuo 
and Kele§ (2013)), and assume ^iki 2 = 1 for the enriched states. Eqn. ([^ can be used to analyze 


allele-specific binding data, where "fikis is the total read counts from both paternal and maternal 
alleles and is constant across s. Application with the binomial distribution also requires that 
t^kis X]i=i likis, Vfc, I, is strictly increasing in s for model identification. 

The MBASIC can be easily extended to other classes of parametric distributions and estimation 
for these distributions follows the same Expectation-Maximization skeleton. While Section [^relies 
on these three distributions to describe the model and the estimation algorithms, the second 
real data example in Section utilizes a more complex parametrization, which demonstrates the 
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wide applicability of the MBASIC framework. Furthermore, we consider the following degenerate 
distribution: 

fs{y\ms,crkis,jikis) = i{y = s), (6) 

where I(.) denotes the indicator function. This degenerate form corresponds to the situation 
where the states, 0ikS,^ are directly observed rather than inferred from Y^s. We utilize this 
parametrization for comparing MBASIC to alternative two-step analysis approaches in Section [3.5| 
Parameter estimation for this case follows a slightly modified procedure from the non-degenerate 
cases, which is described in Section]^ 


2.2 State-space Clustering 

Equation ii) models the distribution of Oik as a mixture of multiple distributions. To illustrate 
this model we introduce additional variables. The goal is to identify J clusters from the set 
of observation units !<*</. Let bi = /(unit i does not belong to any cluster) and Zij = 
/(unit i belongs to cluster j). The hi variables entertain the possibility that some observations are 
’’singletons”, i.e., they do not cluster with any other observational units. With these additional 
variables, the distribution in Equation can be hierarchically decomposed as follows: 

• bi''^'Bernoulli{Q; 

• {z^i, , Zijy'''^'MultiNom{l,{Tri, 1^2,-■■ , ttj )); 

• Conditional on bi and Zij, life’s are independent samples from with sampling probabilities 
P{0ik — s\bi — 1) — Pisi P{Pik — ^\bi — 0, Zij — 1) — Wjks- 

In this set up, although the singleton state-space probabilities pis are assumed to be con¬ 
stant across conditions, i.e., P{0ik = s) = pia, Vfc, this assumption is mildly restrictive since 
it accommodates {P{0ik = I, ■ • • ,P(0ik = S)) to follow an arbitrary prior distribution (e.g., 
{P{0ik = I, • • • J P{0ik = S)) ^ Dirichlet{a, • • • , a), V/c) as long as it leads to the same marginal 
distribution for Oik for all k. 

It is worth noting that this hierarchical structure essentially seeks a low-rank representa¬ 
tion for the matrix 0 = {0ik)i<i<i,i<k<K ■ To illustrate this, we introduce additional matrices 
0s — {^{Oik — ITs — {'^jks')i<j<jp<k<K: Y — )i<z</,i<j<J and vectors 

Ps = {,Pis)i<i<i, B = (&i)i<i</- Then, the conditional expectation of 0^ is: 


E{Qa\Z,B) = {ZWa)o{{l-B)ll) + {paOB)l 


T 


(7) 


where “o” denotes the Hadamard product. We note that E{Qa\Z, B) is a matrix of rank J-|- I, 
which is usually much smaller than the dimension of the matrix 0s. Similar models for low-rank 
representation of discrete matrices were considered in Lee et al. ( 2010| , and turned out to be 
challenging both theoretically and computationally. The row-clustering structure for the matrices 
E{Qs\Z^B) in MBASIC is more restrictive than the general low-rank structure. Such additional 
restrictions not only reduce the difficulty in parameter estimation but also enable the flexibility in 
many useful ways. For example, while Lee et al. (2010) can only estimate one matrix at a time 
and thus is only applicable when S = 2, MBASIC can be applied to arbitrary values of S. 
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3 Model Estimation and Selection 

3.1 Likelihood Functions 


In the MB ASIC model, the likelihood function for both the observed random variables Yiki^s and 
the unobserved 9ik’s, z^’s, bi’s, i.e., full data likelihood, is given by: 

l{fi,(T,TT,p,C,w;y,e,z,b) = 


ric*-(in n np'f-* ■ n n»‘" 


2=1 

IKS 


nnn 


2=1 k—1s—1 

' rik 

f sijjikl \^kls ^kls-i^ikls) 

J-1 


2 = 1 J = 1 

I(6ik=s) I J K S 

nnnn 


I{Oik=s)(l-bi)zij 

Wjks 


2=1 j—lk—1 S=1 


( 8 ) 


For non-degenerate distributions, we can show that the marginal likelihood is: 


K 


l{n, a, 7r,_p, C, w; y) = < dt 


2=1 k—1 


S rik 


^ ^ Pis fs^yikllpkls^ Hkls') tikis') 


.s=l Z=1 


J K 


(i-o^tt, n 


^ ^ Kijks fs i^Vikl Ipkls : ^kls-i 'Jikls ) 


j—1 k=l L-s=l Z=1 


(9) 


Equation ([^ is easily interpretable. Conditional on b^ and Zij, the joint distribution for each 
Yikh 1 < ^ < rife is a mixture of S components, where the weight on the s-th component is either 
Pis (when bi = 1) or wjks (when bi = 0 and = 1). This yields the expressions in the square 
brackets. Integrating out bi and Zij, the joint distribution for Yiki of fixed j is a mixture of J + 1 
components, with probability ^ of being a singleton and probability (I — (^)TTj of belonging to 
cluster j. 

For the degenerate case, by substituting (§ into it can be shown that the marginal likeli¬ 
hood is: 


K s 


i{p,,a,TT,px,w,e )=n b n 

i=l I fc=l s=l 


Iieik=s) 


K S 


j—1 k—1s—1 


i{e.k=s) 

T'fes 


( 10 ) 


3.2 An Expectation and Maximization (E-M) Algorithm 

The hierarchical structure of MBASIC naturally fits in the Expectation-Maximization algorithm 


Dempster et al. (19771, which maximizes the marginal likelihood (equations ([^ or @) by itera¬ 


tively maximizing the complete data log-likelihood function. We let (j) to denote a vector including 
all unknown parameters p, a, tt, p, w, and to denote the parameter estimates at the t-th 
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iteration. The complete data log-likelihood function is: 


IKS 




E[i{e,k = 


( 11 ) 


^ fs {,yikl \ykls ? ^kls ; ^ikls) 

i^i k^i s^i L/^1 

IKS I J 

+EEE log Pis E[I {9ik = s)bi\(j)^* EE \ogTrjE[zij{l - 

i—1 k—1 s—1 i—1 j — 1 

I 

+ ^{logCi?[6.|<^(‘-i)] + log(l - 0(1 - E[b0*-^'>])} 

I K J S 

+EEEE E[I{9ik = s)z^j(l - bi)\^^* logWjks- 

i—1 k—1 j — 1 s—1 

The E-M algorithm for MBASIC is outlined by Algorithm E-step updates are listed in 
equations (12)-(15) and their derivations are provided in Appendix [A| 


Algorithm 1 Expectation-Maximization (EM) 
for t = 1, 2, • • • until convergence do 

Expectation Step: Compute the conditional expectations E[I{9ik = E\bi\4>^'^~^'>], 

E[I{9,k = s)bS^^-% E[z,jil - E[I{9,k = s)z,,il - 

Maximization Step: Update estimates for parameters pku, <^kis, C) Wjks, Pis as maximizers 
for . 

end for 


Eib0^-^j = 


At-i) rr^ At-i)At-i) 

s ilk^l \Z^S^1 Jiks ^ 


( 12 ) 


(1 - E/=i nf=i (eLi nf=i (Ef=i 


E{z,,{l-b,)\^^^-^j = 


''uti (eE/, 


dt-i) ■ 


(t-i) .(t-i) 
= 1 iks jks 


= l ll/c=l yi^s^l Jiks ^jks J 


[1-E(&,|0*-'))], (13) 


EiIi9,k = s)z,,il-b,M^^-^j = 


[i-£;(6,|0‘-'))] 


’Ti 'nE(Eti/, 


dt-i) ■ 


= 1 Jiks ^jks 


Jiks '^jks 


-(t-l)-prif f^S ?(«-!) 

Ej = l O llfc^l (^Es=l /ifcs ^jks J Es=l/ifcs ^jks 


E{I{9,k = s)h0^-^j = E{b0^0 


J iks J^is 


L-^s—1 iks 


(t-i) (t_i) > 


(14) 


(15) 

















= nr=i fiyiki\fi'kis^\^kis ^\ltkis)- Given these results from the E-step, updates 
of C, TTj, Wjks, Pis in the M-step are straight forward as in equations and 


^(t) = 

EUE[h\p-^)] 

I 

(16) 



(17) 



pt) 

Pis = 

j:tiEmk = s)b0^-^)] 

(18) 

ELlEtlE[Iie^k = s)bS^^-^'>]’ 

.(t) 

j:l^,E[i{e,k = s)z,j{i-b,)\^(*-^)] 

(19) 

^jks 

J2s=i ELi E[I{e,k = s)zij{l - b,)\$(*^-^'>] 


Updates for ^kis and <Tkis have to be treated according to the specific distributions. For the 
log-normal distributions ([^, we have: 


.(t) ^ SLi log(2/»fci + l)P[0^k = 

l:Lll^klsP\e^k = s\i^^-^)] ’ 

.(t)2 ^ SLl P[^ik = S|^^*~^^][l0g(y»fci + 1) - PusliklsY 

For the binomial distributions ([^, we have: 


( 20 ) 

( 21 ) 


ut) ^ Y.l=ly^klP[o^k = 

l:Lll^klsP[0^k = s\^^^-^)] 


( 22 ) 


Closed form maximizers of p, and cr do not exist for the negative binomial distribution Q. We 
adopt the method of moment estimates as in Kuan et al. (2011); Zuo and Kele§ (2013), where the 
updated values and are the solutions of the following equations: 


I 

E 


-(t)2 2 
^kls '^ikls 


yfls'^lrklsP[dik = 1 )] 

2=1 



+ y-kLi^ki 


s 


P[9,k = 


'^yikiPlOik = 

2 = 1 

'^yLP[0ik = 


For the degenerate distributions as in (§, Oik’s are directly observed. Therefore, the E-M 
algorithm for this case requires slight modifications: we skip the estimation for E[I{6ik = s)|(^*-‘“^^] 
in the E-step and for p, a in the M-step. 


3.3 Estimating Structured Clusters 

In integrative functional genomics studies, the set of experimental conditions usually consists of 
interactions of multiple experimental factors; hence, it is often important to identify clusters. 
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states of which are homogeneous across the levels of one or more factors. For example, in a 
typical transcription factor network analysis, experimental conditions include the combination 
of different cell types and TFs. It is often desirable to separate loci groups whose states are 
homogeneous within each cell type from those with cell type specific states for the purpose of cell 
type comparison. Depending on the cell types involved, such comparison can yield insights on 
cell development, pathology and/or cell-specihc functions. We refer to clusters with homogeneous 
states within each cell type as TF-homogeneous. Another example is encountered in comparative 
functional genomics studies across different species, where experimental conditions range across 
both species and TFs. Clusters of loci, states of which are homogeneous across species conditional 
on each TF, constitute conserved functional modules. The TF-homogeneous clusters in this context 
represent the marginal effect of the species factor, and play a central role in understanding the 
evolutionary relationships. 

To estimate a cluster with homogeneity for a particular experimental factor, MBASIC allows 
structural constraints on its state-space parameters. Recall that the parameters of cluster j are 
represented by Wj,s = {wjis,Wj 2 sT ■ ■ ■ ,WjKs)- Marginalizing the effect of this factor, the K ex¬ 
perimental conditions can be partitioned into M sets, {1, 2, • • • , AT} = Ti U r 2 U • • • U Tm, where 
conditions within each set differ only in the levels of this factor. The parameters of this cluster 
satisfy the following constraints: 


Wjkis=Wjk 2 s, if 3 m s.t. fci,fc 2 GT, 


(23) 


Cell Type Levels: 
TF Levels: 

TF-homogeneous: 

Cell Type-homogeneous: 


Gml2878 

K562 

Atf3 

Ctcf 

Gatal 

Atf3 

Ctcf 

Gatal 


Wjls Wj2s 

Wj3s 

Wj4^s 

^j5s 

“^jGs 


Wjls 

Wj2s 

Wj3s 

Wj4s 

'^jbs 

'^jGs 


Figure 1: A graphical description for a parametrization with structural constraints. Interactions 
of 2 cell types and 3 TFs result in six experimental conditions. Parameters with homogeneous 
values are shaded by the same color. 


A pictorial depiction with six experimental conditions due to full interaction between 2 cell 
types and 3 TFs is depicted in Figure Estimating structured clustering models follows the 
previous E-M algorithm with a modification in Equation (191. A constrained maximizer for Wjks 


subject to constraint (23) is computed as: 


.(t) 

^3ks = 


*{Tra\ Ef=l ELi ’ 


k G Tm- 


MBASIC requires that such structural constraints must be specihed a priori and remain fixed 
during model fitting. MBASIC incorporates a model selection procedure to compare models with 
different hypothesized structural constraints and numbers of clusters. We next describe the details 
of this model selection procedure. 
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Table 1: Design of the simulation studies. S: size of the state-space; Proportion of singletons; 


I: number of units; J: number of clusters; K: number of experimental conditions. 


Study 

Distribution 

S 

c 

I 

iJ,K) 

Model 

Selec¬ 

tion 

1 

LN, NB, Bin 

2, 3,4 

0 , 0 . 1 , 0.4 

4000 

(20, 30) 

No 

2 

LN, NB, Bin 

2 

0 . 1 , 0.4 

4000 

(20, 30) 

Yes 

3 

iASeq 

3 

0 , 0 . 1 , 0.4 

4000 

(10, 20), (20, 30) 

Yes 

4 

Cormotif 

2 

0 

10,000 

(4, 4), (5, 8), (5, 10) 

Yes 

5 

Cormotif 

2 

0 , 0 . 1 , 0.4 

4000 

(10, 20) 

Yes 

6 

LN 

2 

0 , 0.33 

4120, 4600, 6120 

(8, 30) 

Yes 


3.4 Model Selection 

The MBASIC framework so far assumes that the total number of clusters J is known a priori. In 
practice, models with varying values of J need to be fitted independently and compared with each 
other according to some information criterion to determine the best value of J. Since the E-M 
algorithm aims to maximize the data likelihood function, AIC and BIG criteria can be utilized 
with MBASIC. The degrees of freedom for a model with J clusters is df = FiSJ2k=i n-k + {S — 
1)1 + J + F 2 , where Fi = 2 for distributions (§ and @ ,Fi = l for , and F 2 is the total number 
of free variables among Wjks’s. If there are no structured clusters, we have F 2 = JK{S — 1). 

When there is no prior information available, both the total number of clusters and the number 
of clusters following structural constraints have to be determined. This results in a prohibitively 
large number of candidate models, and computing the information criterion for each of them is 
not practical. In such cases we incorporate the following two-phase strategy to limit the number 
of candidate models: 

1. Evaluate models with varying total number of clusters without any structural constraints. 
Select Jopt according to the minimal AIC or BIC value. 

2. Evaluate models with the fixed number of Jopt clusters while varying the number of clusters 
following each structural constraint. Select the number of clusters following each structural 
constraint based on the minimal AIC or BIC value. 

We acknowledge that the above two-step strategy is only a practical compromise to restrict 
the space of candidate models and does not guarantee finding the best model that globally mini¬ 
mizes the information criterion. However, we have conducted extensive simulation studies which 
illustrated that the proposed two-phase strategy performs well in a wide variety of settings. 

3.5 Simulation Studies 

We conducted 6 model-based simulation studies to investigate the performance of MBASIC in 
various settings as summarized in Table Each simulation study has multiple settings that vary 
the distributional assumptions, size of the state-space (5), proportion of singletons {Q, number of 
units (/), number of clusters (J), and number of conditions (AT). We provide the details of these 
simulation studies in Appendix and highlight the overall conclusions in this section. 
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Data in Simulation Studies 1-2 were simulated according to MB ASIC’s distributional assump¬ 
tions. In Simulation Study I, we emphasized two most important features of MBASIC: the joint 
estimation procedure of all model parameters and the inclusion of a singleton cluster. We derived 
six alternative algorithms (Table B.l) to benchmark MBASIC’s performance in various settings. 
Three of the algorithms (SE-HC, SE-MC, PE-MC) use two-stage procedures for model estimation, 
decoupling either the estimation of the state-space variables or the distributional parameters from 
the mixture modeling of clustering analysis. The other three algorithms are created as variations 
on these by excluding the singleton feature (SE-MCO, PE-MCO, MBASICO). These benchmark 
algorithms are in spirit analogous to procedures in many applied genomic data analyses where the 
association between observational units are estimated separately from the estimation of individual 
data set specific parameters (e.g., Gerstein et al. (20121, Wei et al. (2012), Wei et al. (2015)). 

Figures B.2||B.4 summarize the performance comparisons in Simulation Study 1. We observed 
that MBASIC’s joint estimation feature improved the inference for both the clustering structure 
and the individual states. In the presence of many singletons, the inclusion of their idiosyncratic 
state-space profiles was essential for robust clustering. In Simulation Study 2, we evaluated the 
effect of using BIC to select the number of clusters as well as the structural constraints within each 
cluster. Tables B.2 and B.3 indicate that MBASIC was always able to select models with similar 
structures with the simulated truth. 

In Simulation Studies 3 to 5, we simulated data according to the models proposed by iASeq 
(Wei et al. 2012) and Cormotif (Wei et al. 2015). These models allow heterogeneous distributional 
parameters within the same state, a potential advantage over MBASIC in specific data analysis 
such as differential expression or allele-specific binding. Comparison to these two models is intended 
to enable investigation of whether MBASIC is robust against such within-state heterogeneity. In 
Simulation Study 3, we showed that MBASIC with the binomial distribution could directly handle 
data generated under the iASeq framework and achieve competitive performance (Figure [B.5[ ). In 
Simulation Study 4, we inherited the simulation settings from |Wei et al. (2015), where distributions 
from different states were weakly separable, but the individual states were completely deterministic 
from the clustering. We explored more dynamic settings in Simulation Study 5, where we had easier 
separation between different states, but randomness among the states within the same cluster. We 
showed that a pre-processing step homogenizing the within-state units followed by MBASIC leads 
to comparable performance to Cormotif in Simulation Study 4 (Figure B.6), and much better 
performance in Simulation Study 5 (Figure |B.7| ). 


Wei et al. (2015) discusses an interesting point that when the clustering model does not accom¬ 


modate singletons, small clusters tend to be merged together to form spurious clusters, estimated 
state-space patterns of which are the averages among several true clusters. In order to investi¬ 
gate whether such a phenomena exists for MBASIC, we conducted Simulation Study 6, where we 
simulated data with two large clusters and six small clusters, and compared the performance of 
MBASIC and MBASICO to highlight the effect of including a singleton cluster. We found that 
compared to MBASICO, MBASIC was significantly less aggressive in merging small clusters. Over¬ 
all, it captured large clusters and allocated the small cluster units as singletons (Figures B.IO and 


B.ll Tables B.6 B.7 and B.8). This study highlighted the utility of a singleton cluster as a 


potential remedy for merging of small clusters. 

Combining results from all of our simulation studies, we conclude that MBASIC is a powerful 
model for both state-space estimation and clustering structure recovery. Its adaptability to single- 
tons, effectiveness in model selection, and robustness against within-state heterogeneity strongly 
support its applicability for real data sets. 
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4 Applications of MBASIC to Genome Research Problems 


4.1 Transcription Factor Enrichment Network 

Regulation of gene expression relies heavily on the context-specific combinatorial activities of TFs. 
Gene clustering analysis based on TF occupancy data, i.e., ChIP-seq, aims to identify combinatorial 
patterns of TF occupancy and group genes based on such patterns. The ENCODE consortium 


(The ENCODE Project Consortium 2012) has generated TF ChIP-seq datasets for over 100 TFs 


across multiple cell types, and has motivated several integrative studies for learning regulation 
patterns (Gerstein et al. 2012 |Wang et al. 2012). In this study, we applied MBASIC to the 
analysis of such data. Specifically, we focused on the TF enrichment patterns at the promoter 
regions, i.e., -5000 bps and -1-1000 bps the transcription start site, of the 10290 genes that had 
significant expression, as measured by RNA-seq, in either the Gml2878 or the K562 cells. The 
input data to MBASIC were the mapped numbers of reads at these promoter regions from the 


uniformly processed ChIP-seq data by Gerstein et al. (2012). We chose the cell types Gml2878 


and K562 because they had the largest numbers of TF ChIP-seq experiments. The hnal dataset 
utilized included ChIP-seq data for I = 10290 observational units over 30 TFs corresponding to 
K = 60 experimental conditions (cell type x TF) with a total of 166 replicate experiments. 

We fitted MBASIC with S = 2 states and used log normal distributions as in Equation (§. 
s = 1 corresponded to the unenriched state, and we let 'yikii = log(l + Xik), where Xik is the count 
from the matching control experiment at unit i. s = 2 corresponded to the enrichment state, and 
we let 7 ifeZ 2 = 1 for all loci. 


(a) 


(b) 
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Eigure 2: (a) BIC and (b) log-likelihood values for models with different structures. All the clusters 
are unstructured in the Phase 1 models and the x-axis denotes the total number of clusters. 
The total number of clusters is 24 for Phase 2 models and the x-axis denotes the number of 
unconstrained clusters. The remaining clusters have TE-homogeneity. 

We followed the two-phase procedure using BIC from Section |3.4| to select both the number 
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Figure 3: (a) Normalized data for each cell-TF combination at five sub-sampled loci within each 
cluster, (b) Estimated enrichment probability at each cell-TF combination for each cluster. 
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of clusters and the structure of each cluster. In Phase 1, we selected the number of clusters as 
24. In Phase 2, we considered two types of structural constraints for each cluster, referred to by 
TF-homogeneity and cell type-homogeneity and defined as Wjkis = Wjk 2 s if and k 2 corresponded 
to the same TF or cell type. We found that imposing cell type-homogeneity to any cluster would 
cause that cluster to be degenerate (i.e., no unit was assigned to that cluster). Therefore, we 
chose the final model among those with TF-homogeneity structures. The BIC and log likelihood 
values for different models fitted in both phases are shown in Figure The final model had 24 
unconstrained clusters, consisting of 1 — ^ = 89.8% of the 10290 loci. The ranges of the estimated 
distribution parameters among replicates within the same cell type-TF combination is shown in 
Figure |C.12[ We notice that these parameters can be substantially different among replicated 
experiments. This provides further support for our replicate specific parametrization. 

To compare the normalized data and the predicted enrichment probability for each cluster, we 
computed the normalized signalsj^and compared them to the estimated cluster parameters. Figure 
depicts such normalized signals from five randomly selected loci within each predicted cluster 
(Figure §a)), as well as the predicted enrichment probabilities at the corresponding condition 
and cluster {wjk 2 's) (Figure|^b)). We observe that the estimated enrichment probabilities at the 
cluster level capture the commonality among loci within each cluster. In addition, each loci cluster 
exhibits distinct combinatorial patterns of activity across all cell type-TF combination. The cell 
type-TF combination enriched within each cluster is listed in Table 

Our clustering results are consistent with the existing literature on the TF enrichment networks. 
For example, cooperating TFs tend to be enriched at the same loci. This pattern can be observed in 
Figure [^b) between Bcl3 and Bclafl. Pol2 and Pol24h8 represent Pol2 experiments with different 
antibodies. As expected, we observe enrichment at the same loci for these two different version of 
Pol2 experiments. Moreover, pairs of TFs that have similar binding motifs have similar enrichment 
probabilities over the clusters. For example, Wang et al. (2012) discovered the UAl motif as 


common to both Chd2 and Etsl and the USF motif for Max, Usfl, and Usf2. Interactions between 


Tafl and Tbp have also been studied by Anandapadamanaban et al. (2013). Similar enrichment 
probabilities of these TFs across clusters can be observed in Figure [^b). In addition to these 
observations that are consistent with the literature, our results illustrate how the genome-wide 
TF association patterns can be attributed to specific clusters. We explored the loci clusters with 
distinct patterns between cell types (e.g., Pol2 in Cluster 12, Figure]^, TFs from the same families 
(e.g., Bcl3 v.s. Bclafl in Cluster 3, Figure C.13I, and TFs with similar genome-wide enrichment 
(e.g., Max v.s. Usfl in Cluster 2, Figure C.14) using raw data. We further evaluated each cluster 


of genes for their KEGG pathway enrichment (Subramanian et al. 2005), and identified 8 KEGG 


pathways that are significantly enriched in individual clusters (Table [^. Three of our clusters 
(Glusters 7, 9, and 19) have more than half of their genes in one single pathway. Since KEGG 
pathways curate the known knowledge of molecular interaction systems, these clusters may be 
driven by unknown biological processes that warrant further investigation. 

MBASIC infers the clustering structure based on its own estimates of the state-space profiles. 
The ENCODE consortium provides the estimated enrichment regions (i.e., peaks) for each ex¬ 
periment in this study. Then, a natural question is whether MBASIC reveals more information 
compared to clustering of genes based on ENCODE-estimated binary enrichment profiles of TFs. 


'^The normalized signal for unit i and condition k is: 
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Cell: Gm12878 , TF: Pol2 
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In Cluster 12 No • Yes 
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|ikiilog( Control Sample Count + 1 ) Pkiilog( Control Sample Count + 1 ) 


Figure 4: (a, b) Plots of the transformed Pol2 ChIP sample read counts against the transformed 
control sample read counts for all units in (a) Gml2878 and (b) K562 cells. Data from unenriched 
units are expected to reside around the 45 degree dashed line. 


To address this, we created a binary vector for each gene by overlapping its promoter with the 
ENCODE peaks. Then, we applied the state-of-the-art MClust model (Fraley and Raftery 2002) 
to cluster the 10290 promoter regions based on these peak profiles. MClust selected 90 clusters 
based on BIC. Figure [C. 1 5| displays cluster-level estimated enrichment probabilities of TFs across 
the conditions considered. Compared to Figure we can see that many of the MClust clusters 
have very similar enrichment profiles. For example, Clusters 51, 7, 8, 32, 54 contained almost no 
enrichment for any TFs, but are classified as distinct clusters. The association between units across 
these clusters are thus non-trivial to interpret. In addition, we found that for some conditions, the 
enrichment states predicted by MB ASIC are quite different than those from the ENCODE peak 
profiles (e.g.. Figure]^. This is because the ENCODE peaks are identified by whole genome-wide 


Table 2: Significantly enriched KEGG pathways across the 24 clusters. 


KEGG. name 

# Genes 
Overlapped 

Z Score 

Cluster 

Cluster 

Size 

Protein processing in endoplasmic reticulum 

156 

5.652 

7 

391 

Fatty acid elongation in mitochondria 

7 

7.518 

8 

133 

B cell receptor signaling pathway 

74 

6.016 

9 

146 

Lysine biosynthesis 

3 

6.53 

9 

146 

D-Glutamine and D-glutamate metabolism 

3 

5.548 

12 

184 

Vitamin B6 metabolism 

4 

5.28 

14 

156 

Non-homologous end-joining 

12 

7.539 

17 

213 

Lysosome 

116 

5.402 

19 

187 
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analysis and may not reflect the differences between the ChIP and control samples at the local 
promoter regions. MB ASIC attains larger raw data fidelity by directly modeling the counts at 
each unit rather than inheriting results from existing analyses. 
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Figure 5: (a, b) Transformed ChIP versus control sample read counts from a GmI2878-Ctcf 

dataset. Enrichment states are annotated by (a) ENCODE peak profiles and (b) MBASIC estima¬ 
tion. In MBASIC, an observational unit is estimated to be enriched if its enrichment probability 
satisfies P{9ik = 2|y) > 0.5. 


4.2 Genome-wide Identification of -|-9.5-fike Composite Elements 


Johnson et al. (2012) and Gao et al. (2013) described the requirement of the intronic -1-9.5 site, an 


Ebox-GATA composite element located at chr6: 88143884-88157023 in the mouse genome (genome 
version mm9), to establish the hematopoietic stem/progenitor cell (HSC) compartment in the fetal 
liver and for hematopoietic stem cell genesis in the aorta-gonad-mesonephros (ACM), respectively. 

(2012) and Hsu et al.|(p0T^ showed that heterozygous -1-9.5 mutations 


Furthermore, Johnson et al. 


cause a human immunodeficiency associated with myelodysplastic syndrome (MDS) and acute 
myeloid leukemia (AML). Because the -1-9.5 site is the only known cis-element deletion of which 
depletes fetal liver HSCs and is lethal at E13-14 of embryogenesis, identifying additional loci that 
have similar functionality is extremely important for establishing mechanisms that enable GATA 
factor-bound regions with nonredundant activity and have the potential to reveal novel targets for 
therapeutic modulation of hematopoiesis. In this application, we identified 4803 genomic regions 
with the Ebox-GATA motif (CATCTG-N[7-9]-AGATAA where N[7-9] denotes a variable size spacer 
of 7 and 9 nucleotides) in the human genome (genome version hgl9). We considered a 150 bps 
window anchored at each of the 4803 composite elements as the observational unit. To analyze the 
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TF occupancy activities at these units and identify a group of composite elements with occupancy 
profiles similar to that of the +9.5 composite element, we downloaded all ChIP-seq data for the 


Huvec and K562 cells from Gerstein et al. (2012). In total, the data set contained 224 replicates 


spanning K = SA experimental conditions and 77 TFs. 

We used negative binomial distributions with S = 2 states, where s = 1 denoted the unenriched 
(unoccupied) state, in the MBASIC framework. We chose jikii = 1 + Xik, where Xik is the count 
for unit i from the matching control experiment for condition k, to incorporate data from the 
accompanying control experiments of the ChIP samples. For s = 2, we utilized the following 
mixture distribution to account for the heavy tails observed in the raw data: 

^ikl — 2 ^ ^kl2) “b (1 ^kl3^i 

Vik '"^ Bernoulli{vki)■ 

Here, the constant 3 represents the minimum count threshold for enrichment estimation. The 


use of mixture distributions to capture heavy tailed count data was previously considered by Zuo 


and Kele§ (2013). We note that an alternative approach to capture heavy tailed counts would be 


to fit a model using S = 3 states, with s = 2, 3 representing two distinct enrichment components. 
Such an approach would differ from the proposed approach in a subtle yet important way. In this 
alternative approach, allocation of each unit to different enrichment components would affect the 
clustering estimation, while in our approach, clustering is only determined by the enrichment status 
of the individual unit regardless of which enrichment component it follows. The E-M algorithm 
for this setting requires a slight modihcation as discussed in Section [A)2l 

Following the two phase model selection procedure using BIG, we selected the model with 
3 clusters, 2 of which were cell type-homogeneous. The ranges of the estimated distribution 
parameters among replicates within the same condition are displayed in Figures |C.16| and |C.17| 
The three clusters (denoted by Cl, C2, and C3) included 332, 837, 157 composite elements, 
respectively, and the remaining 3477 composite elements were identified as singletons. A heatmap 
for the enrichment probability of each unit under each cell type-TF combination across the three 
clusters is shown in Figure]^ The +9.5 element is a member of cluster C3 which consists of a total 
of 157 +9.5-like composite elements. A detailed genomic annotation of these elements are provided 
in Table C.IO Notably, 46% of the C3 elements reside in intronic regions and 42% of these are 


within first intron. Only 15% of the cluster are located up to 10Kb upstream of transcription start 
sites. 

A detailed analysis of Figure [^reveals that cluster C3 is driven by several transcription factors 
with known associations to GATA2. First, we note that a large fraction of the C3 loci are bound by 


et al. 


BRGl. The chromatin remodeler BRGl is involved in GATAl-mediated chromatin looping (Kim 
2009a|b) and co-localizes with GATAl at some chromatin sites (Hu et al. 2011). BRGl has 


broad functions in many cell types; however, conditional knockouts of BRGl reveal its importance 


in specific cell and tissue contexts (Holley et al. 2014). Another factor that clearly stands out as 


having a GATA2-hke profile in cluster G3 is ETSl. Our prior work identified the propensity of 


2011 

) and 

Dore et al. 

(2012 


occupied GATA motifs to reside near Ets motifs (Linneman et al. 
has highlighted GATA2-ETS co-localization. 

We next performed an alternative naive analysis by utilizing the list of peaks provided by the 
ENGODE project. As in the case of the Transcription Factor Enrichment Network example of 
Section 4.1, these peaks, provided by the ENGODE consortium, were identified by analyzing each 
dataset individually with ENGODE’s uniform ChIP-seq processing pipeline. Figure [C.18| displays 
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Figure 6: Posterior enrichment probability (i.e,. P{0ik = 2|T)) for all units in the three clusters. 
The right most column of the C3 cluster corresjl^nds to the +9.5 element. 
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Figure 7: Proportion of overlap between the top ranked +9.5-like composite elements identified 
by MBASIC and ENCODE peak profiles. The overlap proportion is calculated by considering 
the same number of top ranked units (x-axis) in both the ENCODE-based and MBASIC-based 
similarities to the -1-9.5 site. The dashed lines mark that 78.3% of the C3 units are ranked in the 
top 157 based on the ENCODE peak profiles. 


the ENCODE peak profiles for our cell type-TF conditions. For each of the 4803 composite 
elements, we constructed a peak profile, which is a binary vector indicating whether the element 
overlaps with the ENCODE peaks for each cell type-TF combination. We then computed the 
peak profile based similarity between the -1-9.5 site and each the of the composite elements using 
the R function dist.binary with the ’’Jaccard index” option. For comparison, we computed 
pseudo-binary similarities between each element and the -1-9.5 site using the MBASIC estimated 
enrichment probabilities across all condition^ We then ranked the composite elements based on 
both ENCODE and MBASIC estimated similarities. Figure provides a comparison of the two 
lists as a function of top ranking composite elements. Overall, we observe that the rankings based 
on MBASIC estimation are consistent with the rankings based on the ENCODE peak profiles. 

Although the rankings of the composite elements with respect to their -1-9.5 similarity using 
both the ENCODE peak profiles and MBASIC estimation were quite similar, the two approaches 
resulted in different enrichment estimation at the individual TF-cell combination level. Figure]^ a) 
compares the estimated cluster-level enrichment probabilities of each cell type-TF combination for 
cluster C3 against their average ENCODE peak profiles and highlights the difference between the 
two procedures. To further investigate these differences, we plotted the raw data for individual 
replicates and compared the composite elements that were estimated to be enriched by the two 
methods. An example using data from K562-Chd2 is displayed in Figures |^b) and (c). Although 
many elements have significantly higher counts in the ChIP sample compared to the control sample, 
they are not identified as occupied by Chd2 in K562 according to ENCODE peak annotation. 

^The pseudo-binary similarity between two units ii and 12 is calculated as 3 ( 11 , 12 ) = 

Efc P{0i^k=-i-\Y}P{ei^k=i\Y) 

Y.k P{eiik^i\Y}+p{0i^k^i\Y}-p{0i^k^i\Y}p{ei^k^i\Y} ■ 
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Figure 8: (a) Top half: Enrichment probabilities for the C3 units across all experimental conditions 
estimated by MBASIC. Bottom half: Proportion of C3 units that are overlapped by the ENCODE 
peaks for each condition, (b, c) ChIP sample read counts against normalized control sample 
read counts for one replicate of K562-Chd2 dataset. Enrichment status are annotated by (a) the 
ENCODE peak profiles and (c) MBASIC prediction. 


21 




Another example using a replicate from K562-Yyl is shown in Figure C.19 where several elements 
with zero ChIP count are overlapped by ENCODE peaks. These results indicate that MBASIC 
provides a grouping of the Ebox-GATA composite elements that is more consistent with the raw 
data compared to grouping based on ENCODE peak annotation. 


5 Conclusions and Discussion 

Clustering analysis based on an underlying state-space is a common problem for many genomic and 
epigenomic studies where multiple data sets over many observational units are integrated. In this 
paper, we developed a unified statistical framework, called MBASIC, for addressing these class of 
problems. MBASIC simultaneously projects the observations onto a hidden state-space and infers 
clustered units in this space. The hierarchical structure of MBASIC enables the information of the 
state-space clusters to be fed back into the projection of the raw data, thus reinforces the accuracy 
of predicting the state-space states of individual units. The MBASIC framework offers flexibility in 
a number of aspects of experimental design, such as different numbers of replicates under individual 
experimental conditions and missing values. Additionally, it is applicable to many parametric dis¬ 
tributions. Our computational studies highlighted good operating characteristics of MBASIC and 
the two genomic applications illustrated how large numbers of ChIP-seq datasets can be integrated 
for addressing specific problems. In both of the applications, MBASIC algorithm converged within 
20 minutes for a fixed model on a 64 bit machine with Intel Xeon 3.0GHz processor and 64GB of 
RAM. For model selection, we utilized R package snow to implement the 2-phase procedure with 
parallel fitting of different candidate models using a 8-core 64 bit, 64GB RAM machine with 8 Intel 
Xeon 3.0GHz processors. These runs were completed under 2 hours. The computational efficiency 
of our model depends on the simple, closed-form updates in our E-M algorithm. Such a mathe¬ 
matical form is due, at least in part, to our modeling assumption that the rows of our state-space 
matrix is clustered. We have argued that this assumption, as compared to the PGA-type model 
structures, offers easier interpretation and is well suited for many genomic applications. MBASIG 
is available as R package mbasic at https://github.com/chandlerzuo/mbasic. 
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A Details of the Expectation-Maximization (EM) Algorithms 

A.l Derivation for the E-step 


We derive the expressions for the E-step updates of our algorithm in Eqns. (12), (HU), (HI’ © 
as well as the marginal likelihood in Eqns. (|^ and (10). 

In what follows, we let Oiks denote \{9ik = s}. The joint density of (z, b, 9, Y) is given by: 
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where fiUs = JXih f{yikl\^Ikls,(Ikls,likls)■ The following elementary equality is used repeatedly 
throughout the rest of the derivations in this section. 
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The joint density of {z,b,Y) can be calculated from Eqn. (24): 
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Eqn. (25) can be rewritten as: 
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The joint distribution of {b,Y) can be calculated from Eqn. (26): 
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Then, Eqn. (|27|) can be rewritten as: 
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We can calculate the marginal density of Y, given in Eqn. ([^, from Eqn. (28) as: 
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Eqn. (10) can be obtained similarly. Moreover, we can rewrite (25) as 
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Thus, the density of (z, Y) can be calculated as: 
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Using Eqns. (28) and (29), we obtain Eqn. (12) as 


E[h\Y] = 


cnEi 

—1 fiksPis^ 

1 

(i-OEE^^nEi 

(eL, 

fiks'^jks^ 

1 + CHaLi I 

^X]s=l fiksPis^ 


Similarly, using Eqns. (|3l| and (|29|), we have 
Ehj\Y] = 


^jC nf=i 1 

—1 fiksPis^ 

1 + (1 - ()nj n^i ( 

—1 fiks'^jks^ 

(UtiEi 

— fiksPis H“ (1 

-OEU^iUti 

(Es=1 fiksWjks^ 


Using Eqns. (26) and (27), we have 
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Eqns. (34) and (32) together results in Eqn. (13). Using Eqns. (24) and (25), we have 
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Finally, we obtain Eqn. (151 using Eqns. (351 and (27): 
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A.2 EM Algorithm with Mixture Data Distributions 


An important extension of the MBASIC model is to allow multiple mixture components within 
each state. Eor example, our model in Section 4.2 models the data from state s = 2 as a mixture 
of two negative binomial distributions following the well motivated model of Kuan et al. (2011): 


Yiki — ‘i\9ik = 2 ~ i^ikiNB{^ki2, (^kn) + (1 — k'iki)NB{ij,ki3,crki3), 

Viki BernouUi{vki), 

where the constant 3 denotes the minimum number of reads required to be in state 9 = 2. In this 
section, we describe the general algorithm for such extensions. We assume that data from state s 
has a distribution of rus components: 


Yikl\9iks — 1 ^ '^. Vklsr fsrj'lPklsrj ^klsr: ^iklsr) • 


This can be written in a hierarchical form, using Vikisr as the hidden variable indicating the 
mixture component within the state: 


iyiklsr) l<r<ms MulUnOm{l, {Vklsr)l<r<7nJ, ^ikl\^iks ^j^iklsr 1 fsr {■\P'kl sr; ^klsri ^iklsr'}- 

.... 

Here, we allow the distribution parameters p and cr as well as the prior information derived 7 
to depend on the component. Let fikisr = fsr{yiki\pkisr,crkisr,likisr)- The joint density for this 
model is: 
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Let fiks = nr=i(S^i '^kisr fikisr), then the joint density for z, b, 9, Y can be expressed exactly 
the same as Eqn. (24). Therefore, the M-step updates for W, P, ( and tt are not changed, with 
the related E-step quantities computed as Eqns. (12), (13), ( [T4| , (15). We only need to modify 
the algorithm to estimate variables that depend on the component index r: p, u, and v. 

The related quantities that need to be computed are E[iyikisr\Y] and E[9iksVikisr\Y]. By Eqn. 
([39|, we have 
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where E[9iks\Y] = E[9iks{^ - bi)zij\Y] + E[9iksbi\Y] and can be computed by Eqns. (|36|) 


and p7|. As a result, 
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Given Eqns. (40) and (41), the M-step update for Vkisr is: 
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The M-step updates for fXkisr^ o'kisr can be derived using Eqn. (40). For the negative binomial 
distribution, as in Section 4.2, we have 


^i/sr liklsE 9iks'^iklsr\^^ ^ 
2 = 1 


9 iks^iklsr 


- 1 ) 


i=l 


^(t)2 2 ( 1 

^klsr^ikls 1 ^ 


1 


^(0 
klsr / 


.{t) 


= ELiE (E,fc/-3), 


= ELiE 0,ksVrklsr\^^*-^^ (Jm - ■ 


B Simulation Studies 

This section presents six broad simulation studies to evaluate the performance of MB ASIC. Each 
simulation study had multiple settings as outlined in Table 1 of the main article. We introduced 
the following four families of distributions in our main article: 


Log-normal Distribution. LN{fikisjikis, crus) with a density function: 
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B.l Simulation Study 1 

The first simulation study investigated the performance of MBASIC when the true value of J was 
known and there were no structured clusters. We set the number of observational units as J = 4000 
and the number of clusters as J = 20. The number of conditions was set to K = 30, and within 
each condition the numbers of replicates varied as Uk = 1, 2, 3, each with probability 0.3, 0.5, and 
0.2. The size of the hidden state space was varied at three levels: S = 2,3,4. We simulated data 
under three distributional families: log-normal (LN) ([^, negative binomial (NB) Q, and binomial 
(Bin) (§. We also varied the proportion of singleton units C at 0, 0.1 and 0.4. For simplicity, we 
set ')ikis = 1 in all distributions. 


B.1.1 Parameter Settings 

Parameters Wjks^s and Pis’s generate the hidden state variables Oik’s. We set them as follows. For 
different values of fc, j, and i, the vectors Wjk- = (wjks : 1 < s < 5) and pi. = {pis : 1 < s < S') were 
simulated independently, each following an S-dimensional Dirichlet distribution Dir {a, ■ ■ ■ ,a). We 
chose a uniform concentration parameter of a = 0.2 for all dimensions to ensure that for each 
vector Wjk- or pi., the probability mass tended to concentrate on one component. This controlled 
the conditional variance of {0ik\bi, zij). An increased value of a would increase the conditional 
variance of Oik., thus make it more difficult to recover Wjks’s and Pis’s. 

The settings for parameters Pkis’s and Ckis’s were important. These parameters connected 
hidden states Oik’s to the observed values Yiki’s. In general, recovering hidden states from the 
observed data is more difficult if: (1) differences of the mean values pkis’s between the states are 
small; (2) variances of the distributions within each state are large. To control these two aspects 
at reasonable levels, we set these parameters as follows: 

• For log-normal distributions ([^, we set = 2 -I- log(4s — 3), simulated pkis ^ Af(^s, 0.05^), 
and set akis = 0.5; 

• For negative binomial distributions Q, we set = 8s — 6, simulated pkis ^ N{^s, 0.5^), and 
set (T/c/i = 2.82, aus = 5 for s = 2, 3, 4; 

• For binomial distributions ([^, we simulated pkis Beta{3s,3{S+1 —s)), and 'jikii = ^iki 2 = 
• • • = likis ^ Pois{10). 


Figure B.l displays the histograms of Fi.i,;, 1 < i < / from one of the simulated data sets for 

all the three distributions with S' = 4 components. For comparison, we also present the histogram 
of an actual data set from the analysis in Section 4 of our main article. We observe that the mixture 
distribution of our simulated data with log-normal or negative-binomial distributions closely follow 
the real data. 


B.l.2 Alternative Approaches for Benchmarking MBASIC 

The MBASIC algorithm is summarized as Algorithm 

To the best of our knowledge, there are currently no existing methods suited for the general 
setup of MBASIC. There are, however, algorithms tailored for analyzing specific data types with hi¬ 
erarchical state-space models similar to MBASIC. These algorithms largely fall into two categories. 
In the first category, estimation for the state-space variables are separated from state-space clus¬ 


tering. Some examples include Cheng et al. (2011), Gerstein et al. (2012), Ji et al. (2013), Nephl 
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Figure B.l: Histograms for a) a real data set from a K562 Pol2 replicate in Section 4.1 of our main 
article; and simulated data from one condition based on one simulation for the b) Log-Normal, c) 
Negative Binomial and d) Binomial distribution with S=4 states. 


Algorithm 2 MBASIC 

for t = 1, 2, • • • until convergence do 

Expectation-Step: Compute the conditional expectations E[I{9ik = E\bi\^^'^~^'>], 

E[I{0,k = E[z,j(l - E[mu = 

Maximization-Step: Update estimates for parameters pkis, o'kis, Ci Wjks, Pis- 

end for 













Table B.l: Simulation Study 1. A summary of the benchmark algorithms that are compared to 
MBASIC. Neither SE-MC nor PE-MC perform joint estimation of the model parameters. SE-* 
algorithms estimate the data-specific model parameters and state-space as a first step and then 
cluster the state variables. PE-* algorithms estimate data-specific model parameters and fixes 
these in joint estimation of the state-space and clustering. * Denotes distributional parameters for 
each experimental replicate. 


Algorithm 

Is State-space 
Estimation Joint 
with Clustering? 

Is Parameter* 
Estimation Joint 
with Clustering? 

Clustering 

model 

Include 

singletons 

MBASIC 

Joint 

Joint 

Mixture model 

Yes 

SE-HC 

Separate 

Separate 

Hierarchical clustering 

No 

SE-MC 

Separate 

Separate 

Mixture model 

Yes 

PE-MC 

Joint 

Separate 

Mixture model 

Yes 

MBASICO 

Joint 

Joint 

Mixture model 

No 

SE-MCO 

Separate 

Separate 

Mixture model 

No 

PE-MCO 

Joint 

Separate 

Mixture model 

No 


et al. (20121. In the second category, distributional parameters for each experimental replicate are 


estimated first. These parameters are then fixed, and the state-space variables and the clustering 
structure are estimated jointly conditional on these estimates. Examples of this approach include 
Wei et al. (20121, Zeng et al. (20131, and Wei et al. ( 2015[ ). 

To compare the general implementation of MBASIC as in Algorithmj^with these existing model 
fitting ideas, we designed six benchmark algorithms. Table |B.1| provides a summary of these 
algorithms. Two of these algorithms, SE-HC (State-space Estimation followed by Hierarchical 
Clustering) and SE-MC (State-space Estimation followed by Mixture model Clustering), treat the 
state-space mapping step and the state-space clustering separately. The third algorithm, PE- 
MC (Parameter Estimation followed by Mixture model Clustering), separates experiment-specific 
distributional parameter estimation from the joint estimation of other parameters. 

For all the three algorithms, in the first step, observations from each experimental condition 
{Yiki :!<*</, 1 <Z< Tifc} are fitted according to the following model: 


{Yikl\0ik — ^ fs{‘\f^kls: ^kls: ^ikls)-, P{^ik — 5) — Qks- 


(46) 


The standard E-M algorithm can be used for the first step and results in estimates of qks, f^kis, o'us 
as well as the posterior estimates for the state space P{9ik = s|E). In the second step, SE-MC and 
SE-HC cluster the observational units based on the estimated P{0ik = s|E) from the first step. 
SE-HC (Algorithm]^ uses hierarchical clustering, while SE-MC (Algorithm]^ uses MBASIC with 
degenerate distributions (|^ for clustering. The second step of PE-MC (Algorithm is similar to 
Algorithm]^ except that parameters /ifczs, o'kis’s are not updated. 

In addition to joint fitting of all model parameters, another important feature of MBASIC is 
its inclusion of the singleton cluster ‘^o- To the best of our knowledge, this feature is not included 


in similar models such as Wei et al. (2012) and Wei et al. (2015). We conjecture that in practice. 


when some units can not be grouped together with other units due to their distinct state-space 
profiles, including this singleton cluster can enhance model estimation. To test this conjecture, 
we developed a version of each of the SE-MC, PE-MC, and MBASIC algorithms that ignore the 
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singleton cluster, i.e., forces each unit into a cluster. This is achieved simply by initializing C = 0 
in the Algorithms and[^ We refer to these algorithms by SE-MCO, PE-MCO, and MBASICO. 


Algorithm 3 State-space estimation followed by hierarchical clustering (SE-HC) 

Step 1: 

for 1 <k < K do 

Apply the standard E-M algorithm on data {Yiki : 1 < i < /, 1 < ? < Ufc} to estimate posterior 
probabilities P{9ik = s|F). 

end for 

Step 2: 

Let 6i = {P{9ik = s|E)}i<fe<^ i<s<s. Cluster vectors 9i into J clusters using hierarchical 
clustering algorithm with the Euclidean distance. Estimate Wjks as the means within each 
cluster. 


Algorithm 4 State-space estimation followed by mixture model clustering (SE-MC) 

Step 1: 

for 1 <k < K do 

Apply the standard E-M algorithm on data {Yiki : 1 < i < /, 1 < Z < nfc} to estimate posterior 
probabilities P(9ik = s|E). 

end for 

Step 2: 

Denote 0*^, = arggmaxP(0ife = sjE) for each 1 < fc < A, 1 < i < /). Apply Algorithmwith 
9ik t— 9*f. and fs = I{y = s) to obtain estimates for Wjks, Pis, C, and nj. 


Algorithm 5 Parameter estimation followed by mixture model clustering (PE-MC) 

Step 1: 

for I < k < K do 

Apply the standard E-M algorithm on data 1<Z< rife} to estimate pkis, 

tTfeis for each experiment. 

end for 

Step 2: 

Apply Algorithm without updating pkis, <^kis in the Maximization step. 


B.1.3 Results 

We utilized several criteria to compare the performance of MBASIC to the benchmark algorithms 
in Table [BT| To estimate how well the state space was characterized for each cluster, we computed 
the mean-squared error for W (MSE-W) as MSE-W=y^^j- s{wjks — WjksY/[JKS) . We also 
evaluated how well each method recovered the true state variables 9ik ’s. This was reflected by the 
state prediction error (SPE) as the mean squared error between the simulated states 9ik ’s and their 
posterior probabilities: SPE=^^. ^ j,[l{0ife = s} — P{9ik = s\Y)Y/{IKS). Finally, to compare 
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Figure B.2: Simulation Study 1, log-normal distribution. Boxplots for ARI, MSE-W, and SPE 
across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of 
singletons at 0, 0.1, 0.4. Table [BT] summarizes the methods compared. 
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Figure B.3: Simulation Study 1, negative binomial distribution. Boxplots for ARI, MSE-W, and 
SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion 
of singletons at 0, 0.1, 0.4. Table [BT] summarizes the methods compared. 
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Figure B.4: Simulation study 1, binomial distribution. Boxplots for ARI, MSE-W, and SPE across 
10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of 
singletons at 0, 0.1, 0.4. Table [BT] summarizes the methods compared. 
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the estimated clustering with the simulated true clustering, we computed the Adjusted Rand 


Index (ARI) (Rand 1971). ARI is a measure for the similarity between two different clusterings 
of the data. Its value ranges between -1 and 1, with 1 indicating perfect match between the two 
clusterings. 

ARI requires the true clusters denoted by 0 < j < J and their estimates denoted by 
0 < j < J. In our simulations, they were computed as: 

% = {I < i < I : h = 1}; = {1 < i < I : h = 0, Zi, = 1}, j < 1, 

where 'i#o denoted the set of singleton units. was computed from the posterior distributions 
as = {1 < f < / : j = argmaxo<j<jP(i S ^j\Y)}, where P{i G ^o|^) = E{bi\Y), and 
PiiGY’,\Y) = E[il-h)z.,\Y]. 

The simulation results under various settings are summarized by the boxplots for each criterion 
in Figures p3.2[ |B.3[ and |B.4| Across all different simulation settings, the performance of MB ASIC 
was consistently among the best in all of the ARI, MSE-W, and SPE metrics. This shows that 
MBASIC could not only recover the clustering structure, but also achieve high accuracy in esti¬ 
mating individual states. SE-HC, SE-MC and SE-MCO performed the worst in both detecting the 
clustering structure and estimating the individual states. This suggests that separating state-space 
inference from joint model fitting can significantly deteriorate model estimation. Different from 
the SE-* methods, performances of PE-MC and PE-MCO were much closer to MBASIC. For the 


negative binomial and binomial distributions (Figures B.3 and B.4|, PE-MC achieved similar ARI 
levels to MBASIC and slightly larger SPE values. These observations show that by jointly esti¬ 
mating the clusters and the states, data under different conditions could borrow information from 
each other and thus substantially improve the state-space estimation. Overall, these observations 


Wei et al.| ( 

2015 

) and 

Wei et al. 

(2012 


The simulation results also highlight the effect of modeling the singleton cluster in various 
settings. Comparing the performances of MBASIC with MBASICO and PE-MC with PE-MCO, we 
see that modeling the singleton cluster does not have a significant effect when the proportion of 
singletons is low, i.e., ^ = 0 or 0.1; however, the improvement is highly significant when (j = 0.4. 
When C = 0.4, including singletons significantly improved the performance with respect to ARI, 
but did not have an obvious effect on SPE. This has several implications in practice. First, the 
fact that MBASIC does not under-perform any other methods when ( = 0 or 0.1 indicates that 
increasing the model complexity by introducing singletons does not lead to unrobust inference. 
Because we are always agnostic on the existence of singletons for any real data, keeping them in 
our model would guard against their adverse influence in inferring the clustering structure. Second, 
although incorporating the singleton cluster does not improve estimating individual states, some 
epigenetic studies focus primarily on the association structure between units, as our example in 
4.2. For such studies, the gain in estimating the clustering structure by including the singletons 
is essential. We note that in the comparison of SE-MCO with SE-MC for the negative binomial 


and the binomial distributions (Figures B.3 and B.4|, modeling the singletons does not necessarily 


improve estimation for separate model fitting even when the proportion of singletons is high, e.g., 
C = 0.4. This might suggest that the state-space estimation step is introducing additional noise to 
the clustering step, which in turn makes it less favorable to infer a complicated clustering structure 
with singletons. 
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Table B.2: Simulation study 2, Scenario 1, unstructured clusters. Simulation results for model 
selection without structural constraints. For each criterion, the mean is computed over 10 simulated 
data sets, with the standard deviation shown in the parentheses. 


Dist. 

c 

J 

ARI 

MSE-W 

SPE 

Bin 

0.1 

20.8 ( 2.098 ) 

0.94 ( 0.036 ) 

0.096 ( 0.018 ) 

0.159 ( 0.014 ) 

Bin 

0.4 

20.9 ( 1.101 ) 

0.914 ( 0.035 ) 

0.122 ( 0.034 ) 

0.204 ( 0.012 ) 

LN 

0.1 

20.7 ( 0.823 ) 

0.989 ( 0.005 ) 

0.044 ( 0.03 ) 

0.086 ( 0.006 ) 

LN 

0.4 

21.3 ( 1.337 ) 

0.972 ( 0.007 ) 

0.095 ( 0.027 ) 

0.107 ( 0.008 ) 

NB 

0.1 

21.6 ( 0.843 ) 

0.947 ( 0.021 ) 

0.089 ( 0.028 ) 

0.154 ( 0.007 ) 

NB 

0.4 

20.6 ( 2.271 ) 

0.902 ( 0.026 ) 

0.112 ( 0.048 ) 

0.189 ( 0.007 ) 


Table B.3: Simulation study 2, Scenario 2, structured clusters. Simulation results for model 
selection with structural constraints. For each criterion, the mean is computed over 10 simulated 
data sets, with the standard deviation shown in the parentheses. 


Dist. 

c 


J 

ARI 

MSE-W 

SPE 

Bin 

0.1 

10.3 ( 1.16 ) 

20.7 ( 1.494 ) 

0.934 ( 0.022 ) 

0.084 ( 0.035 ) 

0.162 ( 0.02 ) 

Bin 

0.4 

10.3 ( 1.636 ) 

21 ( 2.625 ) 

0.897 ( 0.048 ) 

0.125 ( 0.03 ) 

0.196 ( 0.031 ) 

LN 

0.1 

10.4 ( 0.516 ) 

20.6 ( 0.516 ) 

0.984 ( 0.015 ) 

0.044 ( 0.032 ) 

0.086 ( 0.006 ) 

LN 

0.4 

11.2 ( 1.619 ) 

22.5 ( 1.509 ) 

0.968 ( 0.01 ) 

0.108 ( 0.037 ) 

0.106 ( 0.006 ) 

NB 

0.1 

10.9 ( 1.197 ) 

21 ( 1.054 ) 

0.955 ( 0.019 ) 

0.064 ( 0.035 ) 

0.155 ( 0.008 ) 

NB 

0.4 

11.2 ( 1.814 ) 

22.2 ( 1.398 ) 

0.926 ( 0.014 ) 

0.108 ( 0.031 ) 

0.184 ( 0.013 ) 


B.2 Simulation Study 2: Model Selection 


This second set of simulations aimed to evaluate the use of BIG to select the number of clusters 
as well as the structural constraints for each cluster. We simulated data sets under two scenarios. 
For the first scenario, each data set had J = 20 clusters with K = 30 experimental conditions, 
and none of the clusters had structural constraints. For the second scenario, each data set had 
J — 20 clusters over K = 30 conditions, but Ji = 10 of the clusters were structurally constrained 
as follows: 

'CJj^k.s — '^j,k+K/2,st^jj ^ — j S: T/2,VA:, 1 ^ A: ^ Kl2. 

We refer the two scenarios as the unstructured scenario and the structured scenario, respectively. 
We considered log-normal distributions (§, negative binomial distributions 0 and binomial dis¬ 
tributions 0 for both cases. We also varied the proportion of singleton units C at 0.1 and 0.4. 
The number of states was fixed at S' = 2. The remaining parameters were simulated following the 
same mechanism as in Section lB.l.ll 

For each simulated data set, we htted a number of candidate models. For the unstructured 
scenario, we varied the number of clusters J from 10 to 30. For the structured scenario, we followed 
the two-phase procedure described in Section 3.4 of our main article. The best model was selected 
by the minimum BIG value. To assess the performances of thes e selected models, we computed 
the ARI, MSE-W and SPE metrics as described in Section B. 


The simulation results are summarized in Tables B.2 and B.3 Under each set of parameters. 


we computed the mean and the standard deviation for each of the criterion as well as the selected 


®When the actual J and its estimate j are different, MSE-W is redefined as: 
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value of J and Ji under 10 simulated data sets. These tables show that the selected values for J and 
Ji were very close to the true values. Moreover, MBASIC performed uniformly well with respect 
to ARI, MSE-W, and SPE under different settings. These results indicate that even if MBASIC 
may not identify the “true” structure that drives the actual data, the identified structures can still 
properly represent the state-space associations between units. 


B.3 Simulation Studies 3-5: Comparison with iASeq and CorMotif 


In this section, we compare MBASIC with two recently proposed models for integrative analysis 
of specific types of genomic data: CorMotif (Wei et al. 2015) and iASeq (Wei et al. |2012 |. Both 
models have the similar state-space clustering structure as MBASIC. The main difference from 
MBASIC is that they each incorporate more complicated distributional assumptions targeting 
specific genomic data types. The CorMotif model specifically addresses integrative differential 
expression analysis with Uki case condition replicates and control condition replicates for each 
experimental condition k. It inherits the LIMMA (Smyth 2004) framework for differential analysis 
of gene-expression data and assumes mixture of Gaussian distributions with S = 2 states: s = 1 
for the equally expressed state, and s = 2 for the differentially expressed state. Specifically, the 
CorMotif model has the following state-space mapping structure: 


riksl 2 

_2 ^^rik ’ 

^ik 

k'iklcf'lk 

{^ikl \ ^ik — 1 ) ^ 7 ^ ik) '> ^ 5 ^kl j 

{^ikl\^ik — 2 ) T k‘iki^ik)-i ^ ~ ■ ■ * , 77 .^ 1 , 

^ikl {k'ikO: ^ ik') •! ^ — ^5 2, * * * , TifcO: 


where Xn-i's are the observed data from control experiments, and Yiki are the observed data from 
the case experiments. Uk and sf. are hyper parameters specific to each experiment to account 
for potential heterogeneity among units within the same state, and Uk reflects the strength of 
differential expression. CorMotif assumes almost the same state-space clustering structure as 
MBASIC except that it does not include singletons. The iASeq model, targeting at allele-specific 
binding problems has the following state-space mapping structure: 

Yiki -^Binom{'yiki,Pik), 

Pikl^ik — 2 Pk\ 

Pik\d^k = 1 '^Unif (^0, —, 

V ctk + Pk J 

Pik\6ik = 3 nif (—, 1^ , 

\ak + Pk J 


where 


MSE -W = 


" KS{J+J) ^ 


Ci{j) = arg^,<j , C2(j) = argj./<j 
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where the afe, j5k are experiment-specific parameters, and ^iki is the observed total number of reads 
between two alleles. The state-space mapping structure for iASeq is almost the same as MBASIC, 
except that it assumes no singletons, and that one cluster is dedicated to equal binding/occupancy 
between the alleles (i.e., wiki = 1, VI < fc < K). 

There are two key differences between CorMotif/iASeq and MBASIC. First, both CorMotif and 
iASeq address the heterogeneity among the units within the same state, and they introduce addi¬ 
tional hyper parameters to model the heterogeneous parameters associated with the distribution of 
individual units. Compared to MBASIC, where we assume the distributions within the same state 
are homogeneous, such heterogeneous distributional assumptions are much more realistic. Second, 
CorMotif and iASeq implement two-stage estimation procedures similar to PE-MCO, which sep¬ 
arate parameter estimation from state-space clustering. Wei et al. (2015) pointed out that once 
we have the heterogeneous distributional parameters within each state, joint model fitting for all 
parameters would require running a Markov Chain Monte-Carlo algorithm rather than the simple 
E-M algorithm we have developed for MBASIC. Therefore, the computational cost ensued might 
render its applicability for large real data sets. 

In comparison of MBASIC to CorMotif and iASeq, we simulated data according to each of the 
assumed distributions of CorMotif/iASeq, but fitted MBASIC models using simplified distribu¬ 
tions. For data simulated from the iASeq model, we used MBASIC to fit binomial distributions 
with S' = 3 states ([^. For data simulated from the CorMotif model, we first generated two 
versions of t-statistics as follows. For each unit and experiment, denote Yik = 

Xik = YAl\XM/nk 0 , Sjfc = Er=i(^*fci “ '^ikY + J2?=ii^tki “ ^»fc)^]/(nfci + ^^fco “ 2) and 
Vk = 1/u/ci + l/n/cQ. We computed the naive t-statistic Tik as: 


Tik = 


Y,k - X, 
y/Yk^k 


(47) 


We also computed the limma t-statistic Tik by first fitting the data for each condition using 


LIMMA (Smyth 2004) to estimate Uk and s/, then computed: 


Tik = 


'^klk -)- nkl -)- klkO ^{Yik Y ik) 

\/ufe[(nfei -I- Uko - 2)sl Uksl] 


(48) 


For each set of Ti^’s and Tik’s, we fitted the MBASIC model with S = 2 components of scaled-t 
distributions: 


T/ fMks\(^iks — 1 ~ to 


s = l, 2. 


(49) 


Here, fiks is the scaling parameter, and aks is the degrees of freedom. Because we pooled the 
replicate level data to generate these t-statistics, the parameters /i and a no longer depended on 1. 
We refer to the method using Tik as MBASIC-limma, and using Tik as MBASIC-t. Because there 
is no closed form maximum likelihood solution for t-distributions, we use the moment method to 
estimate /r^g’s and CTfcg’s in the M-step similar to the case of negative binomial distributions. 

In Simulation Study 3, we simulated data following the iASeq model. We set ak = h = 2, 
and simulated state-space variables the same as in Section [B . 1 1 with I = 4000. We set J = 10, 20 
and C = 0, 0.1, 0.4. Simulation Studies 4-5 compare MBASIC with CorMotif. In Simulation 
Study 4, we simulated data in four settings corresponding to Simulations 1- 4 of Wei et al. (2015) 
respectively. In these settings, we had Uk = Uk = 4, = 0.02. Table B.4 summarizes the 


settings for the number of clusters, experiment conditions, and units for the state-space variables. 
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Table B.4: Summary for the designs of the simulation settings in Simulation Study 4, originally 
designed by Wei et al. (20151. 


Simulation Setting / J K 

1 10,000 4 T 

2 10,000 4 4 

3 10,000 5 8 

4 10,000 5 20 


Table B.5: Simulation Studies 3-5. A summary of the simulation designs, the htting algorithms 
compared, and the figure numbers for the results. 


Study 

J 

C 

True model 

Fitting algorithms 

Related hgures 

3 

10 , 20 

0, 0.1, 0.4 

iASeq 

MBASIC, iASeq 

Figure 

B.5 


4 

4, 5 

0 

CorMotif 

MBASIC-limma, MBASIC-t, CorMotif 

Figures 

B.6 

B.8 

5 

10 , 20 

0, 0.1, 0.4 

CorMotif 

MBASIC-limma, MBASIC-t, CorMotif 

Figure 

B.V 



We refer our readers to Wei et al. (2015[) for more details of the state-space design. We note that 


Wei et al. (2015) simulations did not include singletons (i.e., C = 0) and furthermore, their settings 
assumed Wjks G {0, 1}. This means that the state-space variables are completely determined by 
the clustering structure. In Simulation Study 5, we set and the same as in Simulation Study 
4, but varied Uk as Uk = 8 for easier distinction between different states. However, we simulated 
Wjks following S-dimensional Dirichlet distributions as in Simulation Study 1 to introduce noises 
in generating state-space variables. In addition, we simulated data with smaller number of units 
(/ = 4000), but more clusters {J = 10, 20), and varied the proportion of singletons C = 0; 0.1, 0.4. 
The other details of generating state-space variables were the same as in Section [BT^ 

Table |B.5| further summarizes the components of the Simulation Studies. For each set of 
parameters, we simulated 10 data sets. We computed ARI, MSE-W, and SPE based on both the 
model with the number of clusters selected by BIC, and the oracle model where the number of 
clusters is set to its true value. The comparison between MBASIC and iASeq is shown in Figure 
|B.5| For all the different settings, MBASIC achieved better clustering performance, with higher 
ARI values. However, iASeq performed better in SPE and MSE-W. When C = 0, iASeq performed 
overall better than MBASIC, with similar ARI values as MBASIC but much lower SPE. However, 
as C increased, iASeq’s ARI value became significantly smaller than MBASIC, while its SPE value 
became closer to MBASIC’s. In such cases, the benefits of modeling singletons seem to outweigh 
the loss of using simplified distributional assumptions. 

The comparison between MBASIC and CorMotif is summarized in Figures |B.6| and |B.7| In 


Simulation Study 4 (Figure B.6), because CorMotif models did not allow singletons, we also ex¬ 
cluded the singleton cluster in htting MBASIC models. MBASIC-limma performed the best except 
in the Hrst setting, where CorMotif achieved the best SPE. Figure |B.8| depicts the average true 
positive rate in detecting states with 9ik = 2 among the 1000 top ranking units for each of the 
four settings. In all but Setting 1, MBASIC-limma performed equally well as CorMotif. We note 
that Setting 1 has the fewest clusters J = 4 and the fewest experimental conditions AT = 4, while 
the other settings have more complicated state-space structures. Performance of MBASIC-t was 
the worst in all the four settings. This suggests that neglecting the heterogeneity in these cases 
can signihcantly increase estimation error. Although MBASIC model alone does not address the 
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Figure B.5: Simulation Study 3, comparison between MBASIC and iASeq. We varied the number 
of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are summarized 
over 10 simulations under each setting. 
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Figure B.6: Simulation Study comparison between MBASIC and CorMotif. We simulated data 
under four settings as in Table [B^ Results are summarized over 10 simulations under each setting. 
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Figure B.7: Simulation Study 5, comparison between MBASIC and CorMotif. We varied the 
number of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are 
summarized over 10 simulations under each setting. 
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Figure B.8: Simulation Study 4> comparison between MBASIC and CorMotif. The average true 
positive rate for the 10 simulations among the 1000 highest ranking unit-experiment pairs for each 
of the four simulation settings in Table |B.4[ For each simulation, the “true positive” set consists 
of (*, kfs with 9ik = 2, and the ranking is based on the posterior probability P{9ik = 2|F). 
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(b) Simulations 4-6 


(a) Simulations 1-3 



Figure B.9: Simulation Study 6. Two settings of the true cluster patterns, represented by the 
matrix K/c2)?<j<8, i<fe< 30 - 


heterogeneity issue, fitting MBASIC models after a data pre-processing step that incorporates 
the heterogeneity structure, such as computing in MBASIC-limma, can significantly improve 
model inference. In Simulation Study 5 where we had stronger signals in separating distribution 
components but noisy state-space clusters, CorMotif resulted in the largest SPE values in all set¬ 
tings (Figure [B .71 ). Although its performance in ARI was comparable with MBASIC-limma when 
C < 0.1, it deteriorated with increasing proportion of singletons, i.e., C = 0.4. Simulation Studies 
4 and 5 collectively suggest that MBASIC’s performance is competitive with CorMotif in settings 
where we have less noise in clustering structure, small numbers of clusters, and some level of single- 
tons despite the fact that the distributional assumptions of MBASIC might be mis-specified. This 
indicates that for real data sets where we are agnostic about the true data generating structure, 
MBASIC might be a more general and robust approach. 


B.4 Simulation Study 6: Weak Clusters 


Wei et al. (20151 pointed out that state-space clustering without accommodating singletons can 
lead to merging of small clusters with large clusters with distinct state-space profiles. Such a 
phenomenon may alter the interpretation of the W matrix, because each column may represent 
the average state-space pattern of several small clusters that lack the data support. It is therefore 
important to investigate whether such a phenomenon still exists for MBASIC where we include a 
singleton cluster. 

We conducted six simulations in Simulation Study 6 to investigate this issue. We simulated 
data according to the log-normal distribution with S = 2 states, J = 8 clusters, and K = 30 
conditions. The number of replicates within each condition, as well as the distribution parameters 
within each state is the same as in Section B.1.1 We set the sizes of the first two clusters as 2000, 


and varied the size of each of the other six clusters, that is Usmaii, as 20 or 100. To vary the level 
of state-space similarity between the two big clusters and the small clusters, we had two settings 
for the state-space pattern as shown in Figure [B?^ For Simulations 1-3, the conditions in which 
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the small clusters have state s = 2 are distinct from the two big clusters, while for Simulations 
4-6, the patterns between the small and large clusters are more similar. To control these cluster 
patterns, we set wjks S {0.1, 0.9}. Finally, we included n-singieton = 0 or 2000 singletons in each 
simulated data set. The states for the singleton units were generated the same as in Section [B.l.l| 
In each simulation, we fitted the data using MBASIC and MBASICO with BIC to select the 
number of clusters. MBASICO differs from MBASIC only by the exclusion of the singleton feature. 
Therefore, comparing these two methods allow us to assess how fitting a singleton cluster may affect 
the small cluster merging problem. Tables! 


i6| and B.7 compare the confusion matrices between 
the fitted and the true clusters. We also display the state-space patterns of the fitted models in 
Figures p3.10| and |B.11| In Simulations I and 4, with rismaU = 20 units in each of the small clusters, 
MBASIC classified the units of small clusters as singletons, while MBASICO merged them to form 
a spurious cluster. The state-space pattern estimated by MBASIC represented the two real big 
clusters. When the data included singletons, as in Simulations 2 and 5, MBASICO formed more 
spurious clusters, while MBASIC continued to allocate the small clusters as singletons. When we 
had risuiaii = 100 units in each of the small clusters, both methods identified these small cluster 
patterns in Simulation 6 (Figure [B.1 I [ ), but formed spurious clusters in Simulation 3 (Figure B.lOl. 
We compare the resulting ARI, MSE-W, and SPE between MBASIC and MBASICO in Table 
B.8 Performances of these two methods are close when we have no singletons, but differentiate 


otherwise. Based on these simulations, we conclude that fitting a singleton cluster can substantially 
avoid merging weak clusters. The state-space patterns estimated by the W matrix are more likely 
to reflect true underlying clusters rather than the average of several small clusters. We acknowledge 
that how well modeling the singletons can avoid merging weak clusters requires further investigation 
in more dynamic settings as we vary the similarity among clusters, the difference among the states, 
as well as other variables that influence cluster structures such as J, K, S. We leave such potential 
investigations as future research. 
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(a) MBASIC 


(b) MBASICO 



(c) MBASIC 


(d) MBASICO 


^small — 20 
^singleton — 
2000 



(e) MBASIC 


(f) MBASICO 



Figure B.IO: Simulation Study 6, Simulations 1-3. Estimated cluster patterns by (a, c, e) MBASIC 
and (b, d, f) MBASICO. The true clustering pattern is shown in Figure [B^ a). 
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(a) MBASIC 


(b) MBASICO 



(c) MBASIC 


(d) MBASICO 


'^small — 20 
^singleton — 
2000 



1985 1974 

Cluster Size 


w 


1 0.75 
0.50 
0.25 



2018 393 457 574 668 2010 


Cluster Size 


(e) MBASIC 


(f) MBASICO 


'^small — 100 
^singleton — 
0 


w 


1 0.75 
0.50 
0.25 



1991 107 97 98 2099 99 101 

Cluster Size 



Figure B.ll: Simulation Study 6^ Simulations 4-6. Estimated cluster patterns by (a, c, e) MBASIC 
and (b, d, f) MBASICO. The true clustering pattern is shown in Figure [B^b). 
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Table B.6: Simulation Study 6, Simulations 1-3. Confusion matrix between the true clusters and 
the estimated clusters. The true cluster pattern is shown in Figure [B^ a). 

Simulation 1, — 20, ngi^gieton — 0 


True 

0 

MBASIC 

1 2 

True 

MBASICO 

1 2 3 

1 

23 

2 

1975 

1 

1998 

2 

0 

2 

30 

1967 

3 

2 

4 

1995 

1 

3 

20 

0 

0 

3 

0 

1 

19 

4 

20 

0 

0 

4 

0 

1 

19 

5 

20 

0 

0 

5 

0 

0 

20 

6 

20 

0 

0 

6 

0 

1 

19 

7 

20 

0 

0 

7 

0 

0 

20 

8 

20 

0 

0 

8 

0 

1 

19 


Simulation 2, Ugniaii — 20, rigingieton — 2000 


True 

0 

MBASIC 

1 2 

3 

True 

1 

2 

MBASICO 

3 4 

5 

6 

0 

1957 

19 

17 

7 

0 

67 

372 

620 

59 

427 

455 

1 

180 

1818 

2 

0 

1 

1957 

25 

0 

4 

0 

14 

2 

153 

3 

1844 

0 

2 

8 

23 

3 

1950 

0 

16 

3 

20 

0 

0 

0 

3 

0 

0 

1 

0 

0 

19 

4 

20 

0 

0 

0 

4 

0 

0 

0 

0 

0 

20 

5 

20 

0 

0 

0 

5 

0 

0 

1 

0 

0 

19 

6 

20 

0 

0 

0 

6 

0 

0 

1 

0 

0 

19 

7 

20 

0 

0 

0 

7 

0 

0 

1 

0 

0 

19 

8 

20 

0 

0 

0 

8 

0 

0 

1 

0 

0 

19 


Simulation 3, n-gj^aH — 100, rigi^gieton — 0 


True 

0 

1 

MBASIC 

2 3 

4 

5 

True 

1 

MBASICO 

2 3 

4 

1 

21 

1974 

0 

0 

5 

0 

1 

1993 

0 

7 

0 

2 

12 

7 

0 

1 

1980 

0 

2 

7 

I 

1991 

1 

3 

1 

1 

9 

0 

0 

89 

3 

1 

0 

0 

99 

4 

5 

0 

89 

0 

0 

6 

4 

0 

2 

0 

98 

5 

6 

0 

2 

92 

0 

0 

5 

0 

99 

0 

1 

6 

1 

0 

0 

13 

0 

86 

6 

0 

6 

0 

94 

7 

0 

0 

96 

4 

0 

0 

7 

0 

100 

0 

0 

8 

0 

0 

0 

97 

0 

3 

8 

0 

99 

0 

1 
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Table B.7: Simulation Study 6, Simulations Jf.-6. Confusion matrix between the true clusters and 
the estimated clusters. The true cluster pattern is shown in Figure [B^ b). 

Simulation 4, ngmaii = 20, nsi„gieton = 0 


True 

0 

MB ASIC 

1 2 

True 

MBASICO 

1 2 3 

1 

5 

1995 

0 

1 

1999 

1 

0 

2 

1 

0 

1999 

2 

0 

0 

2000 

3 

1 

19 

0 

3 

17 

3 

0 

4 

19 

1 

0 

4 

0 

20 

0 

5 

1 

0 

19 

5 

0 

5 

15 

6 

20 

0 

0 

6 

0 

20 

0 

7 

18 

1 

1 

7 

0 

19 

1 

8 

20 

0 

0 

8 

1 

19 

0 


Simulation 5, — 20, rigingieton — 2000 


True 

0 

MB ASIC 
I 

2 

True 

1 

2 

MBASICO 

3 4 

5 

6 

0 

1986 

7 

7 

0 

16 

393 

457 

561 

560 

13 

1 

31 

1969 

0 

1 

1990 

0 

0 

5 

5 

0 

2 

45 

0 

1955 

2 

0 

0 

0 

5 

12 

1983 

3 

11 

9 

0 

3 

12 

0 

0 

1 

7 

0 

4 

20 

0 

0 

4 

0 

0 

0 

1 

19 

0 

5 

8 

0 

12 

5 

0 

0 

0 

0 

6 

14 

6 

20 

0 

0 

6 

0 

0 

0 

0 

20 

0 

7 

20 

0 

0 

7 

0 

0 

0 

0 

20 

0 

8 

20 

0 

0 

8 

0 

0 

0 

1 

19 

0 


Simulation 6, ngmaii = 100, rigingieton = 0 


True 

0 

1 

2 

MBASIC 

3 4 5 

6 

7 

True 

1 

2 

MBASICO 

3 4 5 

6 

7 

1 

3 

1983 

14 

0 

0 

0 

0 

0 

1 

1976 

24 

0 

0 

0 

0 

0 

2 

1 

0 

0 

0 

0 

1999 

0 

0 

2 

0 

0 

0 

0 

2000 

0 

0 

3 

0 

8 

92 

0 

0 

0 

0 

0 

3 

7 

93 

0 

0 

0 

0 

0 

4 

1 

0 

0 

1 

98 

0 

0 

0 

4 

0 

0 

99 

0 

0 

1 

0 

5 

1 

0 

0 

0 

0 

99 

0 

0 

5 

0 

0 

0 

0 

99 

0 

1 

6 

1 

0 

0 

0 

0 

0 

99 

0 

6 

0 

0 

1 

0 

0 

0 

99 

7 

1 

0 

1 

96 

0 

1 

0 

1 

7 

0 

2 

1 

1 

1 

95 

0 

8 

0 

0 

0 

0 

0 

0 

0 

100 

8 

0 

0 

0 

100 

0 

0 

0 
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Table B.8: Simulation Study 6. ARI, MSE-W, and SPE in all simulations. 


Simulation 

^small 

^singleton 

ARI 

MBASIC 

MSE-W 

SPE 

ARI 

MBASICO 

MSE-W 

SPE 

1 

20 

0 

0.967 

0.433 

0.185 

0.991 

0.29 

0.189 

2 

20 

2000 

0.79 

0.442 

0.192 

0.727 

0.34 

0.193 

3 

100 

0 

0.969 

0.203 

0.187 

0.975 

0.233 

0.193 

4 

20 

0 

0.977 

0.384 

0.169 

0.983 

0.260 

0.167 

5 

20 

2000 

0.926 

0.384 

0.185 

0.773 

0.333 

0.184 

6 

100 

0 

0.949 

0.087 

0.172 

0.947 

0.087 

0.171 
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Condition 



Condition 



Condition 


Figure C.12: The range of the estimated pararrfiSers iikis and akis among the different replicates 
under the same experimental condition for the transcription factor enrichment network data in 
Section 4.1. 





(a) 


(b) 


Cell: Gm12878 , TF: Bcl3 


Cell: Gm12878 , TF: Bclaf1m33 


In Cluster 3 No • Yes 


In Cluster 3 No • Yes 



0 1 2 3 4 5 

|4kiilog( Control Sample Count + 1 ) 



PKiilog( Control Sample Count + 1 ) 


Figure C.13: Plots of the transformed ChIP sample read counts against the transformed control 
sample read counts for all units in the Gml2878 cell for (a) Bcl3 and (b) Bclafl. Data from 
unenriched units are expected to locate around the 45 degree dashed line. 


(a) 


(b) 


Cell: Both , TF: Max 


Cell: Both , TF: Usfl 


In Cluster 2 No • Yes 


In Cluster 2 No • Yes 


^ 6 - 

+ 



I I I I I I 

0 12 3 4 5 


Pkiilog( Control Sample Count + 1 ) 



0 1 2 3 4 5 

Pkiilog( Control Sample Count + 1 ) 


Figure C.14: Plots of the transformed ChIP sample read counts against the transformed control 
sample read counts for all units in both Gml2878 and K562 cells for (a) Max and (b) Usfl. Data 
from unenriched units are expected to locate around the 45 degree dashed line. 
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Figure C.15: 


Estimated enrichment probability for each of the 90 clusters identified by MClust. 
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Condition 


Figure C.16: The range of the estimated parameters ^kis among the different replicates under the 
same experimental condition for the +9.5 composite element data in Section 4.2. 
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Figure C.17: The range of the estimated parameters aku among the different replicates under the 
same experimental condition for the +9.5 composite element data in Section 4.2. For a subset of the 
replicates, the data for an individual state can be under-dispersed, resulting in a negative value for 
the estimated size parameter in the negative binomial distribution. In that case, we set crkis = 100. 
Although our model do not fully capture the under-dispersion patterns, the large variations in 
for a subset of the experimental conditions suggest that assuming replicate-specific distributions 
is quite necessary. 
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Figure C.18: Enrichment states provided by the ENCODE peak profiles. Seven empty rows 
corresponding to the TEs that lack ENCODE p^ltk profiles. Note that only a small percentage of 
the composite elements which harbor the canonical GATA binding site are identified as enriched 
for GATA family transcription factors. This suggests that the ENCODE peak profiles can be 
conservative. 
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Figure C.19: (a, b) ChIP sample read counts against control sample read counts for one replicate 
with K562-Yyl. Enrichment status are annotated by (a) the ENCODE peak profiles and (b) 
MBASIC prediction. 
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Table C.9: Enriched cell type-TF combination for each cluster in the TF enrichment network 
analysis of Section 4.1 of the main text. TFs with estimated enrichment probability > 95% are 
listed for each cluster._ 


Cluster 

# of 
Loci 

Common TF 

Gml2878 Specific 

K562 Specific 

1 

34 

Bcl3, Max, Spl, Tafl, 
Zbtb33 

Atf3, Etsl, Jund, Nrfl, 
Pol24h8, Sin3ak20, Tr4 

Egrl, Pul, Rad21, Usf2 

2 

317 

Bcl3, Chd2, Max, 

Pol24h8, Spl, Tafl 

Atf3, Etsl, Nrfl, 

Sin3ak20, Usf2 

Bclafl, Egrl, Pul, Smc3, 
Tbp, Zbtb33 

3 

490 

Etsl, Pol2, Pol24h8, 
Sin3ak20, Tafl, Tbp, Usfl 

Atf3, Chd2, Nrfl, Usf2 

Bcl3, Bclafl, Max, Spl 

4 

555 

Bcl3, Bclafl, Chd2, Pol2, 
Pol24h8, Spl, Tafl, Tbp 

Atf3, Etsl, Nrfl, 

Sin3ak20, Six5, Smc3 

Pul 

5 

428 

Bcl3, Chd2, Etsl, 

Pol24h8, Spl, Tafl 

Atf3, Ctcf, Nrfl, Rad21, 
Sin3ak20 

Bclafl, Pol2, Smc3, Tbp 

6 

729 

Bcl3, Chd2, Etsl, Pol2, 
Pol24h8, Sin3ak20, Smc3, 
Spl, Tafl 

Atf3, Nrfl 

Bclafl, Egrl, Tbp 

7 

391 



Bcl3, Bclafl, Chd2, Egrl, 
Etsl, Smc3, Spl 

8 

133 

Ctcf 

Atf3, Chd2, Rad21 

Smc3, Srf 

9 

146 

Ctcf 


Bclafl, Pol24h8, Smc3, 
Tbp 

10 

469 

Pol2, Pol24h8, Tafl 

Smc3 

Bclafl, Etsl, Sin3ak20, 
Tbp 

11 

440 

Gabp, Pol24h8, Tafl 

Etsl, Smc3 

Pol2 

12 

184 


Bcl3, Chd2, Etsl, Nrfl, 
Pol24h8, Tafl 


13 

277 

Chd2, Pol2, Pol24h8, Spl, 
Tafl, Tbp 

Smc3 

Cfos 

14 

156 

Chd2, Pol2, Pol24h8, Tbp, 
Zbtb33 

Etsl, Smc3, Tafl 


15 

412 

Pol2, Pol24h8 



16 

327 


Pol24h8 


17 

213 

Usfl 

Atf3, Usf2 

Max 

18 

241 

Six5 

Etsl, Smc3 


19 

187 

Chd2, Spl 


Cfos 

20 

222 



Etsl 

21 

385 



Pol24h8 

22 

343 

Ctcf 


Smc3 

23 

449 


Nrfl, Smc3 


24 

1674 
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Table C.IO: Annotations for +9.5 Element-like loci in 5pl (2Kb upstream of tran¬ 
scription start site (TSS)), 5p2 (2Kb to 10Kb upstream of TSS) and intronic regions. 
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Continued on the next page. 






NM_130837 OPAl chr3 + 193310932 193415599 intron 67680 193378528 193378697 0.659 

NM_130836 OPAl chr3 + 193310932 193415599 intron 67680 193378528 193378697 0.659 

NM_130835 OPAl chr3 + 193310932 193415599 intron 67680 193378528 193378697 0.659 

NM_130833 OPAl chr3 + 193310932 193415599 intron 67680 193378528 193378697 0.659 

NM_001004342 TRIM67 chrl + 231298673 231357314 5p2 -2591 231295998 231296167 0.659 




..... ..... , 

ddddddddddddddddddddddddddddddddddddddddddddd 


^ iM O 00 CO CO 00 

: f~) cn ^ rv] rsi rci 



^ ^ CO CO CO 
_I O CN Q Q 


oSS<0CO^2C5C^S^^^00 00b-SG5'-(.-H.-H'i^C0^t^^t>OCN^^C0C05900 00 00b-P«MC0g[^[^ 


^^§S§Ss°Ss22 




5§§QOQof^ioin50'^i^’-'o^^ 


2S<^^fO^<^^C5l050co^-coco-^^>cogSo^ggSSSScocoSot^^c^c^§§w<^^c^^-'^ooo^(M(M 
coco' 'csjiCO'^^^'-Hi-Hi-Hi-HiOO^^T-H I I (M’^CMd OqlOiOiCOi-lOOit-t- 


fi fl C Cl c 

OOOClCNOiMO 

sis-is-aas-iD.*- 
^ ^ ^ ^ 


C C Cl C G C 

0 0 O O O 0 


Cl C 
O IM O 

ii 0. h 


G fl C Cl c C 

O 0 O O O 0 

S-I S-i S-i ^ S-I S-i 


CCCGCICICIC' 


CCflflClClClflClCl 


Cl Cl C Cl Cl C 

OOOiMClOOO 


C C Cl Cl C G 

0 0 0 O O 0 

S-i bri S-i S-i S-I S-i 


Cl Cl c 
O CN O O 

Si G, !- 


flclclclclflflclcic' 


^ h h 

^ Cl Cl 


S||*”||SSi|goo!S|S°°»Sffife|||^5!2|3f^f^ggt2®®®g 

5SiSSS2^:eSg§222S°22^gSSg§||^:S5giS??gg^^????§° 


^ d O O ^ _, 

oSSooS^’-'t^rnrnrnt^t^'-irr;'^’-''-'f^GO'^?T8^5(V5^ 
(MS^lOin2j:^COt-^^^lOlOb-SoO<MCN-^COC5Sgg§'-'';0--H^2 


K_ b- CO CO 

55 o g ^ ^ 

lO CO ^ 


COCN^g^OOOO^^CiCOCOCO^ 
, CN CN ^ O 


^ ^ O IM IM 

^ il 50 O I> 

g oi oi 

^ ^ 2 2 


Oi S ^ <0^ I 

K_ CO CO ^ . 

O ^ ^ lO I 

^ CO CO lf2 1 

2 2 3 3 ; 


O IM 
of CO 
O 


oooo222oocoS'^^^05i5oooK^SooOO'-(T-H2}2iocoi^:::;coio50i-H5SocoK?i::i: 
C^^SS^"^"^^l^St>OOQOCOlOOO*"g§§OCO.^^i2 


j^b-00CNT-H^C5O:^22 


””§§§S2§?5^”" 


+ + ' + 






+ + 


iA-i pA-i ^ f-, pA-i f-t ^ P^ P^ pA-i ^ ^ f-, pi_^ f-i ^ f-, f-, pi_^ f-i pj^ p^ pi_^ pA_| f-i ^ f-, pi_^ pA_| (^ pA_l pj^ p^..^ — pin i.i-1 i^ iA-i 1.I-4 

oo'§'§uy'§'§ouy'§'§'§u'^'§'§'§U'§oooo'§'§'§uot3l3ljl3wuy'5oy'§yoo 


SZZ§Cw,,^^SSS<<o,J«mm3J^«gggg2S 

i^^2Ss§^^s888SoSs8S|gsss|<<|: 


CO CO 

S:2 S ^ < < Q Q 

2 5 S: c5 o ^ 


m j 3 ^ 


I (Y*i r T ^ 0^ 

z<;(;h(;hCQm‘t;c4&H25S^c<<i 
X^-SStfPioSNoJ^^QXX 


I-H CO CO OCi 

^ofiOCOcNiO^COCOb-OC500i-HCOI>'-HOOiOi-iOiOS^i0^d‘®’-'°0'^'^^-‘^‘^'^^'^^°0'®'0'^ 

OlO'^O^GOri'M'-H^COlOi-lb-CNCOCOCOCOb-^CO^COb-^iMb-OCib-COtMC^COGOlOO^OCilOlOC^ 

iMOOC^fO«2ki05C^CNOOOOCO'^COI>lOb-iOiOb.l02^^2cOCO^^'-l'^'-HCOlOCOO:2cNCO'^b- 

OCOlO'^Of^^’-''^^-^^f^OOl-^oflO^^^C^^^y:)lO^O(MXCOOOC^^lOOCOCO^-^0^'^^^^^A^^>‘^^'^^ 

iMI>0'-HcoI>lJ'^>’-'Oof'^Ob-00'-iOT-iO(MOdc0^d'-HCOQO(M(Mi-HOOOCOOOdaiOOiO 

0 ^ 000 ^q 000 ^'-i<N 00000 CN 000 qt-i 050 '-''-i 0000000 '-Hq'-i 0 'M'-h 

SSSS2s3SSSSSSSSSSSSSpCS2SS3SSSSSSSSSSS2SSSS 

XXXXXXSXXXXXXXXXXXXXXXSXXSXXXXXXXXXXXSXXXX 
X XX X 


b- CO of 
O CO lO 
I-H Oi (M 
lO O CO 
CM CO CO 


sss 

XXX 


61 


Continued on the next page. 





NM_030984 TBXASl chr7 + 139528951 139720123 intron 24336 139553203 139553372 0.591 

NM_001166253 TBXASl chr7 + 139528951 139720123 intron 24336 139553203 139553372 0.591 

NM_001061 TBXASl chr7 + 139528951 139720123 intron 24336 139553203 139553372 0.591 

NR_029394 TBXASl chr7 + 139528951 139720123 intron 24336 139553203 139553372 0.591 

NM_173542 PLBD2 chrl2 + 113796370 113827458 intron 20911 113817197 113817366 0.591 
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